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Foreword 


I am thankful to the editors for sending me a prerelease copy of their book and for 
their kind invitation to write a foreword. I have greatly benefitted by reading the 
book and learning fresh ideas shared by distinguished experts from academia as well 
as professionals with serious engagement in the development of new generations of 
automobiles and driver assistance systems. 

Ihave been fortunate to enjoy a long academic career pursuing a research agenda 
involving a number of topics discussed in the book. These include computer vision, 
intelligent vehicles and systems, machine learning, and human-robot interactions. 
For the past two decades, my team and collaborators have focused our explorations 
on developing human-centered, safe intelligent vehicles and transportation systems. 
Therefore, I am especially pleased to read this book presenting latest developments, 
findings, and insights in automated driving with towards safety as a highlighted term 
in the title and as a thread binding various chapters in the book. Emphasis on safety 
as an essential requirement was not always the case in the past. Reports of serious 
accidents involving automobiles with inadequate capabilities for automated driving 
are making safety an uncompromising requirement. 

However, progress in the field does not occur at a consistent, predictable manner 
or rate. In the early phase, spanning around 1980-2000, autonomous vehicle research 
activities were mostly limited to a handful of universities and industrial labs. Impor- 
tant insights and milestones realized during this phase were novel perception and 
control systems for lane detection, cruise control, and carefully designed demon- 
strations of “autonomous” highway driving. However, interest and sponsorship of 
research in the field diminished and real-world deployment was unrealized. 

In the next phase of autonomous vehicles research, spanning about mid-2000 
to 2015, key enabling technologies, such as Global Positioning Systems and cost- 
effective mass production of sensors (radars, cameras) and embedded processors, 
not only invigorated serious activities in academic and technical communities, but 
also sparked interest and commitments by venture and commercial sectors. 

It seemed that this time, our field was on an irreversible forward trajectory. Impor- 
tant insights and milestones realized during this phase include the introduction of 
active safety systems like dynamic stability control, and practical driver assistance 
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systems, such as lane keep assistance, collision avoidance, pedestrian protection, 
panoramic viewing, and other camera-based enhancements. We also started to take 
note of the great strides made in the machine learning and computer vision fields. 
Some of the major limitations encountered in the autonomous vehicles field, such 
as “brittle” object detectors trained using traditional, hand-crafted features, seemed 
resolvable with new developments in machine learning and computer vision. 

Topics covered in this book are from the latest phase (from 2015 onwards) of 
autonomous vehicle research activities. Authors clearly acknowledge the impor- 
tant and essential roles of deep neural networks (DNN) and data-driven intelligent 
systems in making new contributions. They also recognize that there are unresolved 
challenges laying ahead. More focused research of both analytical and experimental 
nature is required. As a long-term proponent of safe, human-centered intelligent 
vehicles, I appreciate that the book places prominent emphasis on safety, integrated 
systems, robustness, performance evaluation, experimental metrics, benchmarks, and 
protocols. This makes the book very unique and an essential reading for students and 
scholars from multiple disciplines including engineering, computer/data/cognitive 
sciences, artificial intelligence, machine learning, and human/machine systems. 

This is a very important book, as it deals with a topic surrounded with great excite- 
ment, not only in the technical communities, but also in the minds of people at large. 
With great excitement and wide popular interest in the autonomous driving field, 
often there is a temptation to overlook or ignore unresolved technical challenges. 
Given grave consequences of failures of complex systems deployed in the real world 
inhabited by humans, designers, engineers, and promoters of autonomous driving 
systems, we need to be most careful and attentive to the topics presented in the book. 
The editors and authors have done an outstanding job in presenting important themes 
and carefully selected topics in a systematic, insightful manner. The book identifies 
promising avenues for focused research and development, as well as acknowledges 
outstanding challenges yet to be resolved. 

I congratulate Prof. Tim Fingscheidt, Prof. Hanno Gottschalk, and Prof. Sebastian 
Houben, for their vision and efforts in organizing an outstanding team of authors 
with deep academic and industrial backgrounds in developing this special book. I 
am confident that it will prove to be of great value in realizing a new generation of 


safer, smoother, trustworthy automobiles. 
Mohan s 


La Jolla, CA, USA Mohan M. Trivedi 


Mohan M. Trivedi Distinguished Professor of Engineering University of California, San Diego, 
founding director of the Computer Vision and Robotics Research (CVRR) and the Laboratory for 
Intelligent and Safe Automobiles (LISA). Fellow IEEE, SPIE, and IAPR. 


Preface 


This book addresses readers from both academia and industry, since it is written by 
authors from both academia and industry. Accordingly, it takes on diverse viewing 
angles, but keeps a clear focus on machine-learned environment perception in 
autonomous vehicles. Special interest is on deep neural networks themselves, their 
robustness and uncertainty awareness, the data they are trained on, and, last but not 
least, on safety aspects. 

The book is also special in its structure. While a first part is an extensive survey 
of literature in the field, the second part consists of 14 chapters, each detailing a 
particular aspect in the area of interest. This book wouldn’t exist without the large- 
scale national flagship project KI Absicherung (safe AI for automated driving), 
with academia, car manufacturers, and suppliers being involved to work towards 
robustness and safety arguments in machine-learned environment perception for 
autonomous vehicles. I came up with the idea of a book project when I saw a first 
draft of a wonderful survey that was prepared by Sebastian Houben in collaboration 
with many partners from the project. My immediate thought was that this should 
not only be some project deliverable, it should become the core of a book (namely, 
Part I), appended with a number of novel contributions on specific topics in the field 
(namely, Part II). Beyond Sebastian Houben, I am happy that Hanno Gottschalk 
agreed to join the editorial team with his expertise in uncertainty and safety aspects. 
We thank KI Absicherung project leader (PL) Stephan Scholz (Volkswagen), Michael 
Mock (Scientific PL, Fraunhofer IAIS), and Fabian Hiiger (Part PL, Volkswagen) 
very much, since without their immediate and ongoing support this book project 
wouldn’t have come alive. We also thank the many reviewers for the time and energy 
they put into providing valuable feedback on the initial versions of the book chapters. 

But the book reaches beyond the project. There are several contributions from 
outside the project enriching the diversity of views. Aiming at increasing impact, we 
editors decided to make the book an open-access publication. The funds for the open- 
access publication have been covered by the German Federal Ministry for Economic 
Affairs and Energy (BMWi) in the framework of the “Safe AI for Automated Driving” 
research consortium. Beyond that we are very grateful for the financial support both 
from Bergische Universitat Wuppertal and Technische Universität Braunschweig. 
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A very special thanks goes to Andreas Bär, TU Braunschweig, for his endless 
efforts in running the editorial office, for any extra proofreading, formatting, and for 
keeping close contact to all the authors. 

Tim Fingscheidt, in the name of the editorial team, jointly with Hanno Gottschalk 
and Sebastian Houben 


Braunschweig, Germany Tim Fingscheidt 
Wuppertal, Germany Hanno Gottschalk 
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About This Book 


An autonomous vehicle must be able to perceive its environment and react 
adequately to it. In highly automated vehicles, camera-based perception is increas- 
ingly performed by artificial intelligence (AI). One of the greatest challenges in 
integrating these technologies in highly automated vehicles is to ensure the func- 
tional safety of the system. Existing and established safety processes cannot simply 
be transferred to machine learning methods. 

In the nationally funded project KI Absicherung, which is funded by the German 
Federal Ministry for Economic Affairs and Energy (BMWi), 24 partners from 
industry, academia, and research institutions collaborate to set up a stringent safety 
argumentation, with which AJ-based function modules (AI modules) can be secured 
and validated for highly automated driving. The project brings together experts 
from the fields of artificial intelligence and machine learning, functional safety, and 
synthetic sensor data generation. 

The topics and focus of this book have a large overlap with the field of research 
being tackled in subproject three of the KI Absicherung project, in which methods 
and measures are developed to identify and reduce inherent insufficiencies of the AI 
modules. We consider such methods and measures to be key elements in supporting 
the general safeguarding AI modules in a car. 

We gratefully thank the editors of this book as well as Springer Verlag for making 
this research accessible to a broad audience and wish the reader interesting insights 
and new research ideas. 

Stephan Scholz, Michael Mock, and Fabian Hiiger 


Introduction 


Environment perception for driver assistance systems, highly automated driving, and 
autonomous driving has seen the same disruptive technology paradigm shift as many 
other disciplines: machine learning approaches and along with them deep neural 
networks (DNNs) have taken over and oftentimes replaced classical algorithms from 
signal processing, tracking, and control theory. 

Along with this, many questions arise. Of fundamental interest is the question, 
how safe an artificial intelligence (AI) system actually is. Part I of this book provides 
an extensive literature survey introducing various relevant aspects to the reader. It 
not only provides guidance on how to inspect and understand deep neural networks, 
but also on how to overcome robustness issues or safety problems. Part I will also 
provide motivations and links to Part II of the book, which features novel approaches 
in various aspects of deep neural networks in the context of environment perception, 
such as their robustness and uncertainty awareness, the data they are trained on, and, 
last but not least, again, it will be about safety. 

It all starts from choosing the right data from the right set of sensors, and, very 
important, the right amount of training data. Chapter “Does Redundancy in AI 
Perception Systems Help to Test for Super-Human Automated Driving Perfor- 
mance?” (p. 81) will provide insights into the problem of obtaining statistically 
sufficient test data and will discuss this under the assumption of redundant and inde- 
pendent (sensor) systems. The reader will learn about the curse of correlation in 
error occurrences of redundant sub-systems. A second aspect of dataset optimiza- 
tion is covered in Chapter “Analysis and Comparison of Datasets by Leveraging 
Data Distributions in Latent Spaces” (p. 107). Investigating variational autoencoder 
latent space data distributions, it will present a new technique to assess the dissimi- 
larity of the domains of different datasets. Knowledge about this will help in mixing 
diverse datasets in DNN training, an important aspect to keep both training efficiency 
and generalization at a high level. When training or inferring on real data, outlier 
detection is important and potentially also to learn so far unknown objects. Chapter 
“Detecting and Learning the Unknown in Semantic Segmentation” (p. 277) presents 
a method to detect anomalous objects by statistically analyzing entropy responses in 
semantic segmentation to suggest new semantic categories. Such method can also be 
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used to pre-select diverse datasets for training. Chapter “Optimized Data Synthesis 
for DNN Training and Validation by Sensor Artifact Simulation” (p. 127) follows 
the path of training with synthetic data, which has high potential, simply because 
rich ground truth is available for supervised learning even on sequence data. The 
authors are concerned with intelligent strategies of augmentation and use an appro- 
priate metric to better match synthetic and real data, thereby increasing performance 
of synthetically trained networks. 

A second focus area of this book is robustness. Robustness comprises the ability 
of a DNN to keep high performance also under adverse conditions such as varying 
weather and light conditions, dirt on the sensors, any kind of domain shift, or even 
under adversarial attacks. In Chapter “Improved DNN Robustness by Multi-task 
Training with an Auxiliary Self-Supervised Task” (p. 149), it is shown how intel- 
ligent multi-task training can be employed to strengthen a semantic segmentation 
DNN against various noise types and against adversarial attacks. An important aspect 
is that the supporting task of depth estimation does not require extra ground truth 
labels. Among the adversarial attacks, the universal ones are particularly dangerous 
as they can be applied to any current sensor input. Chapter “Improving Transfer- 
ability of Generated Universal Adversarial Perturbations for Image Classification 
and Segmentation” (p. 171) addresses a way to generate these dangerous attacks both 
for semantic segmentation and for classification tasks, and should not be missed in 
any test suite for environment perception functions. Obtaining compact and efficient 
networks by compression techniques typically comes along with a degradation in 
performance. Not so in Chapter “Joint Optimization for DNN Model Compression 
and Corruption Robustness” (p. 405): The authors present a joint pruning and quan- 
tization framework which yields more robust networks against various corruptions. 
Mixture-of expert architectures for DNN aggregation can be a suitable means for 
improved robustness but also interpretability. Chapter “Evaluating Mixture-of-Ex- 
perts Architectures for Network Aggregation” (p. 315) also shows that the models 
deal favorably with out-of-distribution (OoD) objects. 

Uncertainty and interpretability are third focus areas of the book. Both concepts 
are fundamental towards self-awareness of environment perception for autonomous 
driving. In Chapter “Invertible Neural Networks for Understanding Semantics 
of Invariances of CNN Representations” (p. 197), invertible neural networks are 
deployed to recover invariant representations present in a trained network. They 
allow for a mapping to semantic concepts and thereby interpretation (and manipu- 
lation) of inner model representations. Confidence calibration, on the other hand, is 
an important means to obtain reliable confidences for further processing. Chapter 
“Confidence Calibration for Object Detection and Segmentation” (p. 225) proposes 
a multivariate calibration approach not only suitable for classification but also for 
semantic segmentation and object detection. Also aiming at improved object detec- 
tion calibration, the authors of Chapter “Uncertainty Quantification for Object Detec- 
tion: Output- and Gradient-Based Approaches” (p. 251) show that output-based and 
learning-gradient-based uncertainty metrics are uncorrelated. This allows advanta- 
geous combination of both paradigms leading to better overall detection accuracy 
and to better detection uncertainty estimates. 
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Part II started with safety aspects related to datasets (Chapter “Does Redundancy 
in AI Perception Systems Help to Test for Super-Human Automated Driving Perfor- 
mance?”, p. 81). In this fourth focus area, further safety-related aspects are taken 
up, including validation of networks. To validate environment perception functions, 
Chapter “A Variational Deep Synthesis Approach for Perception Validation” (p. 359) 
introduces a novel data generation framework for DNN validation purposes. The 
framework allows for effectively defining and testing common traffic scenes. The 
authors of Chapter “The Good and the Bad: Using Neuron Coverage as a DNN Vali- 
dation Technique” (p. 383) present a useful metric to assess the training regimen 
and final quality of a DNN. Different forms of neuron coverage are discussed 
along with their roots in traditional verification and validation (V&V) methods. 
Finally, in Chapter “Safety Assurance of Machine Learning for Perception Func- 
tions” (p. 335), an assurance case is constructed and acceptance criteria for DNN 
models are proposed. Various use cases are covered. 


Tim Fingscheidt 
Hanno Gottschalk 
Sebastian Houben 
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Abstract Deployment of modern data-driven machine learning methods, most often 
realized by deep neural networks (DNNs), in safety-critical applications such as 
health care, industrial plant control, or autonomous driving is highly challenging 
due to numerous model-inherent shortcomings. These shortcomings are diverse and 
range from a lack of generalization over insufficient interpretability and implausible 
predictions to directed attacks by means of malicious inputs. Cyber-physical systems 
employing DNNs are therefore likely to suffer from so-called safety concerns, prop- 
erties that preclude their deployment as no argument or experimental setup can help 
to assess the remaining risk. In recent years, an abundance of state-of-the-art tech- 
niques aiming to address these safety concerns has emerged. This chapter provides a 
structured and broad overview of them. We first identify categories of insufficiencies 
to then describe research activities aiming at their detection, quantification, or miti- 
gation. Our work addresses machine learning experts and safety engineers alike: The 
former ones might profit from the broad range of machine learning topics covered 
and discussions on limitations of recent methods. The latter ones might gain insights 
into the specifics of modern machine learning methods. We hope that this contribu- 
tion fuels discussions on desiderata for machine learning systems and strategies on 
how to help to advance existing approaches accordingly. 
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1 Introduction 


In barely a decade, deep neural networks (DNNs) have revolutionized the field of 
machine learning by reaching unprecedented, sometimes superhuman, performances 
on a growing variety of tasks. Many of these neural models have found their way into 
consumer applications like smart speakers, machine translation engines, or content 
feeds. However, in safety-critical systems, where human life might be at risk, the 
use of recent DNNs is challenging as various model-immanent insufficiencies are 
yet difficult to address. 

This work summarizes the promising lines of research in how to identify, address, 
and at least partly mitigate these DNN insufficiencies. While some of the reviewed 
works are theoretically grounded and foster the overall understanding of training 
and predictive power of DNNs, others provide practical tools to adapt their devel- 
opment, training, or predictions. We refer to any such method as a safety mech- 
anism if it addresses one or several safety concerns in a feasible manner. Their 
effectiveness in mitigating safety concerns is assessed by safety metrics [CNH+18, 
OOAG1I9, BGS+19, SS20a]. As most safety mechanisms target only a particular 
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insufficiency, we conclude that a holistic safety argumentation [BGS+19, SSH20, 
SS20a, WSRA20] for complex DNN-based systems will in many cases rely on a 
variety of safety mechanisms. 

We structure our review of these mechanisms as follows: Sect. 2 focuses on dataset 
optimization for network training and evaluation. It is motivated by the well-known 
fact that, in comparison to humans, DNNs perform poorly on data that is structurally 
different from training data. Apart from insufficient generalization capabilities of 
these models, the data acquisition process and distributional data shifts over time 
play vital roles. We survey potential counter-measures, e.g., augmentation strategies 
and outlier detection techniques. 

Mechanisms that improve on robustness are described in Sects.3 and 4, respec- 
tively. They deserve attention as DNNs are generally not resilient to common per- 
turbations and adversarial attacks. 

Section 5 addresses incomprehensible network behavior and reviews mechanisms 
that aim at explainability, e.g., a more transparent functioning of DNNs. This is 
particularly important from a safety perspective as interpretability might allow for 
tracing back model failure cases thus facilitating purposeful improvements. 

Moreover, DNNs tend to overestimate their prediction confidence, especially on 
unseen data. Straightforward ways to estimate prediction confidence yield mostly 
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unsatisfying results. Among others, this observation fueled research on more sophis- 
ticated uncertainty estimations (see Sect.6), redundancy mechanisms (see Sect.7), 
and attempts to reach formal verification as addressed in Sect. 8. 

At last, many safety-critical applications require not only accurate but also near 
real-time decisions. This is covered by mechanisms on the DNN architectural 
level (see Sect.9) and furthermore by compression and quantization methods (see 
Sect. 10). 

We conclude this review of mechanism categories with an outlook on the steps 
to transfer a carefully arranged combination of safety mechanisms into an actual 
holistic safety argumentation. 


2 Dataset Optimization 


The performance of a trained model inherently relies on the nature of the underlying 
dataset. For instance, a dataset with poor variability will hardly result in a model ready 
for real-world applications. In order to approach the latter, data selection processes, 
such as corner case selection and active learning, are of utmost importance. These 
approaches can help to design datasets that contain the most important information, 
while preventing the so-much-desired information from getting lost in an ocean of 
data. For a given dataset and active learning setup, data augmentation techniques are 
very common aiming at extracting as much model performance out of the dataset as 
possible. 

On the other hand, safety arguments also require the analysis of how a model 
behaves on out-of-distribution data, data that contains concepts the model has not 
encountered during training. This is quite likely to happen as our world is under 
constant change, in other words, exposed to a constantly growing domain shift. 
Therefore, these fields are lately gaining interest, also with respect to perception in 
automated driving. 

In this section, we provide an overview over anomaly and outlier detection, active 
learning, domain shift, augmentation, and corner case detection. The highly relevant 
problem of obtaining statistically sufficient test data, even under the assumption 
of redundant and independent systems, will then be discussed in Chapter “Does 
Redundancy in AI Perception Systems Help to Test for Super-Human Automated 
Driving Performance?” [GRS22]. There it will be shown that neural networks trained 
on the same computer vision task show high correlation in error cases, even if training 
data and other design choices are kept independent. Using different sensor modalities, 
however, diminishes the problem to some extent. 
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2.1 Outlier/Anomaly Detection 


The terms anomaly, outlier, and out-of-distribution (OoD) data detection are often 
used interchangeably in literature and refer to task of identifying data samples that 
are not representative of training data distribution. Uncertainty evaluation (cf. Sect. 6) 
is closely tied to this field as self-evaluation of models is one of the active areas of 
research for OoD detection. In particular, for image classification problems it has 
been reported that neural networks often produce high confidence predictions on OoD 
data [NYC15, HG17]. The detection of such OoD inputs can either be tackled by 
post-processing techniques that adjust the estimated confidence [LLS18, DT 18] or by 
enforcing low confidence on OoD samples during training [HAB19, HMD19]. Even 
guarantees that neural networks produce low confidence predictions for OoD samples 
can be provided under specific assumptions (cf. [MH20b]). More precisely, this work 
utilizes Gaussian mixture models that, however, may suffer from high-dimensional 
data and require strong assumptions on the distribution parameters. Some approaches 
use generative models like GANs [SSW+17, AAAB18] and autoencoders [ZP17] 
for outlier detection. The models are trained to learn in-distribution data manifolds 
and will produce higher reconstruction loss for outliers. 

For OoD detection in semantic segmentation, only a few works have been pre- 
sented so far. Angus et al. [ACS19] present a comparative study of common OoD 
detection methods, which mostly deal with image-level classification. In addition, 
they provide a novel setup of relevant OoD datasets for this task. Another work trains 
a fully convolutional binary classifier that distinguishes image patches from a known 
set of classes from image patches stemming from an unknown class [BKOS 18]. The 
classifier output applied at every pixel will give the per-pixel confidence value for 
an OoD object. Both of these works perform at pixel level and without any sophisti- 
cated feature generation methods specifically tailored for the detection of entire OoD 
instances. Up to now, outlier detection has not been studied extensively for object 
detection tasks. In [GBA+19], two CNNs are used to perform object detection and 
binary classification (benign or anomaly) in a sequential fashion, where the second 
CNN takes the localized object within the image as input. From a safety standpoint, 
detecting outliers or OoD samples is extremely important and beneficial as training 
data cannot realistically be large enough to capture all situations. Research in this area 
is heavily entwined with progress in uncertainty estimation (cf. Sect.6) and domain 
adaptation (cf. Sect. 2.3). Extending research works to segmentation and object detec- 
tion tasks would be particularly significant for leveraging automated driving research. 
In addition to safety, OoD detection can be beneficial in other aspects, e.g., when 
using local expert models. Here, an expert model for segmentation of urban driving 
scenes and another expert model for segmentation of highway driving scenes can 
be deployed in parallel, where an additional OoD detector could act as a trigger on 
which models can be switched. We extend this discussion in Chapter “Detecting and 
Learning the Unknown in Semantic Segmentation” [CURG22], where we investigate 
the handling of unknown objects in semantic segmentation. In the scope of semantic 


Inspect, Understand, Overcome: A Survey of Practical Methods for AI Safety 9 


segmentation, we detect anomalous objects via high-entropy responses and perform 
a Statistical analysis over these detections to suggest new semantic categories. 

With respect to the approaches presented above, uncertainty-based and generative- 
model-based OoD detection methods are currently promising directions of research. 
However, it remains an open question whether they can unfold their potential well 
on segmentation and object detection tasks. 


2.2 Active Learning 


It is widely known that, as a rule of thumb, for the training of any kind of artificial 
neural network, an increase of training data leads to increased performance. Obtain- 
ing labeled training data, however, is often very costly and time-consuming. Active 
learning provides one possible remedy to this problem: Instead of labeling every data 
point, active learning utilizes a query strategy to request labels from a teacher/an ora- 
cle which leverage the model performance most. The survey paper by Settles [Set10] 
provides a broad overview regarding query strategies for active learning methods. 
However, except for uncertainty sampling and query by committee, most of them 
seem to be infeasible in deep learning applications up to now. Hence, most of the 
research activities in active deep learning focus on these two query strategies, as we 
outline in the following. 

It has been shown for image classification [GIGI7, RKG18] that labels corre- 
sponding to uncertain samples can leverage the network’s performance significantly 
and that a combination with semi-supervised learning is promising. In both works, 
uncertainty of unlabeled samples is estimated via Monte Carlo (MC) dropout infer- 
ence. MC dropout inference and a chosen number of training epochs are executed 
alternatingly, after performing MC dropout inference, the unlabeled samples’ uncer- 
tainties are assessed by means of sample-wise dispersion measures. Samples for 
which the DNN model is very uncertain about its prediction are presented to an 
oracle and labeled. 

With respect to object detection, a moderate number of active learning meth- 
ods has been introduced [KLSL18, RUN18, DCG+19, BKD19]. These approaches 
include uncertainty sampling [KLSL18, BKD19] and query-by-committee meth- 
ods [RUN18]. In [KLSL18, DCG+19], additional algorithmic features specifically 
tailored for object detection networks are presented, i.e., separate treatment of the 
localization and classification loss [KLSL18], as well as weak and strong super- 
vision schemes [.DCG+19]. For semantic segmentation, an uncertainty-sampling- 
based approach has been presented [MLG+18], which queries polygon masks for 
image sections of a fixed size (128 x 128). Queries are performed by means of accu- 
mulated entropy in combination with a cost estimation for each candidate image 
section. Recently, new methods for estimating the quality of a prediction [DT18, 
RCH+20] as well as new uncertainty quantification approaches, e.g., gradient-based 
ones [ORG18], have been proposed. It remains an open question whether they are 
suitable for active learning. Since most of the conducted studies are rather of academic 
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nature, also their applicability to real-life data acquisition is not yet demonstrated 
sufficiently. In particular, it is not clear whether the proposed active learning schemes, 
including the label acquisition, for instance, in semantic segmentation, is suitable to 
be performed by human labelers. Therefore, labeling acquisition with a common 
understanding of the labelers’ convenience and suitability for active learning are a 
promising direction for research and development. 


2.3 Domains 


The classical assumption in machine learning is that the training and testing datasets 
are drawn from the same distribution, implying that the model is deployed under the 
same conditions as it was trained under. However, as it is mentioned in [MTRA+12, 
JDCR12], for real-world applications this assumption is often violated in the sense 
that the training and the testing set stems from different domains having different 
distributions. This poses difficulties for statistical models and the performance will 
mostly degrade when they are deployed on a domain D"*, having a different dis- 
tribution than the training dataset (i.e., generalizing from the training to the testing 
domain is not possible). This makes the study of domains not only relevant from the 
machine learning perspective, but also from a safety point of view. 

More formally, there are differing notions of a “domain” in literature. For [Csu17, 
MD18], a domain D = {¥, P(x)} consists of a feature space ¥ C R7 together with 
a marginal probability distribution P(x) with x € æ. In [BCK+07, BDBC+10], a 
domain is a pair consisting of a distribution over the inputs together with a labeling 
function. However, instead of a sharp labeling function, it is also widely accepted 
to define a (training) domain pain — {(x,, Y;)}7_, to consist of n (labeled) samples 
that are sampled from a joint distribution P (x, y) (cf. [LCWJ18]). 

The reasons for distributional shift are diverse—as are the names to indicate a 
shift. For example, if the rate of (class) images of interest is different between training 
and testing set this can lead to a domain gap and, result in differing overall error rates. 
Moreover, as Chen et al. [CLS+18] mention, changing weather conditions and cam- 
era setups in cars lead to a domain mismatch in applications of autonomous driving. 
In biomedical image analysis, different imaging protocols and diverse anatomical 
structures can hinder generalization of trained models (cf. [KBL+17, DCO+19]). 
Common terms to indicate distributional shift are domain shift, dataset shift, covari- 
ate shift, concept drift, domain divergence, data fracture, changing environments, 
or dataset bias. References [Sto08, MTRA+12] provide an overview. Methods and 
measures to overcome the problem of domain mismatch between one or more (cf. 
[ZZW+18]) source domains and target domain(s) and the resulting poor model per- 
formance are studied in the field of transfer learning and, in particular, its subtopic 
domain adaptation (cf. [MD18]). For instance, adapting a model that is trained on 
synthetically generated data to work on real data is one of the core challenges, as 
can be seen [CLS+18, LZG+19, VJB+19]. Furthermore, detecting when samples 
are out-of-domain or out-of-distribution is an active field of research (cf. [LLLS18] 


Inspect, Understand, Overcome: A Survey of Practical Methods for AI Safety 11 


and Sect. 2.1 as well as Sect. 8.2 for further reference). This is particularly relevant 
for machine learning models that operate in the real world: if an automated vehicle 
encounters some situation that deviates strongly from what was seen during training 
(e.g., due to some special event like a biking competition, carnival, etc.) this can lead 
to wrong predictions and thereby potential safety issues if not detected in time. 

In Chapter “Analysis and Comparison of Datasets by Leveraging Data Distri- 
butions in Latent Spaces” [SRL+22], a new technique to automatically assess the 
discrepancy of the domains of different datasets is proposed, including domain 
shift within one target application. The aptitude of encodings generated by different 
machine-learned models on a variety of automotive datasets is considered. In par- 
ticular, loss variants of the variational autoencoder that enforce disentangled latent 
space representations yield promising results in this respect. 


2.4 Augmentation 


Given the need for big amounts of data to train neural networks, one often runs into 
a situation where data is lacking. This can lead to insufficient generalization and an 
overfitting to the training data. An overview over different techniques to tackle this 
challenge can be found in [KGC17]. One approach to try and overcome this issue 
is the augmentation of data. It aims at optimizing available data and increasing its 
amount, curating a dataset that represents a wide variety of possible inputs during 
deployment. Augmentation can as well be of help when having to work with a 
heavily unbalanced dataset by creating more samples of underrepresented classes. A 
broad survey on data augmentation is provided by [SK19]. They distinguish between 
two general approaches to data augmentation with the first one being data warping 
augmentations that focus on taking existing data and transforming it in a way that 
does not affect labels. The other options are oversampling augmentations, which 
create synthetic data that can be used to increase the size of the dataset. 

Examples of some of the most basic augmentations are flipping, cropping, rotat- 
ing, translating, shearing, and zooming. These are affecting the geometric proper- 
ties of the image and are easily implemented [SK19]. The machine learning toolkit 
Keras, for example, provides an easy way of applying them to data using their 
ImageDataGenerator class [C+15]. Other simple methods include adaptations 
in color space that affect properties such as lighting, contrast, and tints, which are 
common variations within image data. Filters can be used to control increased blur or 
sharpness [SK19]. In [ZZK+20], random erasing is introduced as a method with sim- 
ilar effect as cropping, aiming at gaining robustness against occlusions. An example 
for mixing images together as an augmentation technique can be found in [Ino18]. 

The abovementioned methods have in common that they work on the input data 
but there are different approaches that make use of deep learning for augmentation. 
An example for making augmentations in feature space using autoencoders can be 
found in [DT17]. They use the representation generated by the encoder and gener- 
ate new samples by interpolation and extrapolation between existing samples of a 


12 S. Houben et al. 


class. The lack of interpretability of augmentations in feature space in combination 
with the tendency to perform worse than augmentations in image space presents 
open challenges for those types of augmentations [WGSM17, SK19]. Adversarial 
training is another method that can be used for augmentation. The goal of adversar- 
ial training is to discover cases that would lead to wrong predictions. That means 
the augmented images won’t necessarily represent samples that could occur during 
deployment but that can help in achieving more robust decision boundaries [SK19]. 
An example of such an approach can be found in [LCPB18]. Generative modeling 
can be used to generate synthetic samples that enlarge the dataset in a useful way 
with GANs, variational autoencoders and the combination of both are important 
tools in this area [SK 19]. Examples for data augmentation in medical context using a 
CycleGAN [ZPIE17] can be found in [SYPS19] and using a progressively growing 
GAN [KALL18] in [BCG+18]. Next to neural style transfer [GEB15] that can be 
used to change the style of an image to a target style, AutoAugment [CZM+19] and 
population-based augmentation [HLS+19] are two more interesting publications. In 
both, the idea is to search a predefined search space of augmentations to gather the 
best selection. 

The field of augmenting datasets with purely synthetic images, including related 
work, is addressed in Chapter “Optimized Data Synthesis for DNN Training and 
Validation by Sensor Artifact Simulation” [HG22], where a novel approach to apply 
realistic sensor artifacts to given synthetic data is proposed. The better overall quality 
is demonstrated via established per-image metrics and a domain distance measure 
comparing entire datasets. Exploiting this metric as optimization criterion leads to 
an increase in performance for the DeeplabV3+ model as demonstrated on the 
Cityscapes dataset. 


2.5 Corner Case Detection 


Ensuring that AI-based applications behave correctly and predictably even in unex- 
pected or rare situations is a major concern that gains importance especially in safety- 
critical applications such as autonomous driving. In the pursuit of more robust AI 
corner cases play an important role. The meaning of the term corner case varies in 
the literature. Some consider mere erroneous or incorrect behavior as corner cases 
[PCYJ19, TPJR18, ZHML20]. For example, in [BBLFs19] corner cases are referred 
to as situations in which an object detector fails to detect relevant objects at rele- 
vant locations. Others characterize corner cases mainly as rare combinations of input 
parameter values [KKB 19, HDHH20]. This project adopts the first definition: inputs 
that result in unexpected or incorrect behavior of the AI function are defined as 
corner cases. Contingent on the hardware, the AI architecture and the training data, 
the search space of corner cases quickly becomes incomprehensibly large. While 
manual creation of corner cases (e.g., constructing or re-enacting scenarios) might 
be more controllable, approaches that scale better and allow for a broader and more 
systematic search for corner cases require extensive automation. 
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One approach to automatic corner case detection is based on transforming the input 
data. The DeepTest framework [TPJR18] uses three types of image transformations: 
linear, affine, and convolutional transformations. In addition to these transformations, 
metamorphic relations help detect undesirable behaviors of deep learning systems. 
They allow changing the input while asserting some characteristics of the result 
[XHM+11]. For example, changing the contrast of input frames should not affect the 
steering angle of a car [TPJR18]. Input—output pairs that violate those metamorphic 
relations can be considered as corner cases. 

Among other things, the white-box testing framework DeepXplore [PCYJ19] 
applies a method called gradient ascent to find corner cases (cf. Sect. 8.1). In the 
experimental evaluation of the framework, three variants of deep learning architec- 
tures were used to classify the same input image. The input image was then changed 
according to the gradient ascent of an objective function that reflected the difference 
in the resulting class probabilities of the three model variants. When the changed (now 
artificial) input resulted in different class label predictions by the model variants, the 
input was considered as a corner case. 

In [BBLFs19], corner cases are detected on video sequences by comparing pre- 
dicted frames with actual frames. The detector has three components: the first com- 
ponent, semantic segmentation, is used to detect and locate objects in the input frame. 
As the second component, an image predictor trained on frame sequences predicts 
the actual frame based on the sequence preceding that frame. An error is determined 
by comparing the actual with the predicted (i.e., expected) frame, following the idea 
that only situations that are unexpected for AI-based perception functions may be 
potentially dangerous and therefore a corner case. Both the segmentation and the pre- 
diction error are then fed into the third component of the detector, which determines 
a corner case score that reflects the extent to which unexpected relevant objects are 
at relevant locations. 

In [HDHH20], a corner case detector based on simulations in a CARLA environ- 
ment [DRC+17] is presented. In the simulated world, AI agents control the vehicles. 
During simulations, state information of both the environment and the AI agents are 
fed into the corner case detector. While the environment provides the real vehicle 
states, the AI agents provide estimated and perceived state information. Both sources 
are then compared to detect conflicts (e.g., collisions). These conflicts are recorded 
for analysis. Several ways of automatically generating and detecting corner cases 
exist. However, corner case detection is a task with challenges of its own: depending 
on the operational design domain including its boundaries, the space of possible 
inputs can be very large. Also, some types of corner cases are specific to the AI 
architecture, e.g., the network type or the network layout used. Thus, corner case 
detection has to assume a holistic point of view on both model and input, adding 
further complexity and reducing transferability of previous insights. 

Although it can be argued that rarity does not necessarily characterize corner 
cases, rare input data might have the potential of challenging the AI functionality 
(cf. Sect.2.1). Another research direction could investigate whether structuring the 
input space in a way suitable for the AI functionality supports the detection of corner 
cases. Provided that the operational design domain is conceptualized as an ontol- 
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ogy, ontology-based testing [BMM18] may support automatic detection. A properly 
adapted generator may specifically select promising combinations of extreme param- 
eter values and, thus, provide valuable input for synthetic test data generation. 


3 Robust Training 


Recent works [FF15, RSFD16, BRW18, AW19, HD19, ETTS19, BHSFs19] have 
shown that state-of-the-art deep neural networks (DNNs) performing a wide variety 
of computer vision tasks, such as image classification [KSH12, HZRS15, MGR+18], 
object detection [Girl15, RDGF15, HGDG17], and semantic segmentation [CPSA17, 
ZSR+19, WSC+20, LBS+19], are not robust to small changes in the input. 

Robustness of neural networks is an active and open research field that can be 
considered highly relevant for achieving safety in automated driving. Currently, most 
of the research is directed toward either improving adversarial robustness [SZS+14] 
(robustness against carefully designed perturbations that aim at causing misclassifica- 
tions with high confidence) or improving corruption robustness [HD 19] (robustness 
against commonly occurring augmentations such as weather changes, addition of 
Gaussian noise, photometric changes, etc.). While adversarial robustness might be 
more of a security issue than a safety issue, corruption robustness, on the other hand, 
can be considered highly safety-relevant. 

Equipped with these definitions, we broadly term robust training here as methods 
or mechanisms that aim at improving either adversarial or corruption robustness 
of a DNN, by incorporating modifications into the architecture or into the training 
mechanism itself. 

In this section, we cover three widespread techniques for fostering robustness dur- 
ing model training: hyperparameter optimization, modification of loss, and domain 
generalization. Additionally, in Chapter “Improved DNN Robustness by Multi-Task 
Training With an Auxiliary Self-Supervised Task” [KFs22], an approach for robusti- 
fication via multi-task training is presented. Semantic segmentation is combined with 
the additional target of depth estimation and is proven to show increased robustness 
on the Cityscapes and KITTI datasets. 

A useful metric to assess the training regimen and final quality of a neural network 
is presented in Chapter “The Good and the Bad: Using Neuron Coverage as a DNN 
Validation Technique” [GAHW22]. The use of different forms of neuron coverage 
is discussed and juxtaposed with pair-wise coverage on a tractable example that is 
being developed. 
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3.1 Hyperparameter Optimization 


The final performance of a neural network highly depends on the learning process. 
The process includes the actual optimization and may additionally introduce training 
methods such as dropout, regularization, or parametrization of a multi-task loss. 

These methods adapt their behavior for predefined parameters. Hence, their 
optimal configuration is a priori unknown. We refer to them as hyperparameters. 
Important hyperparameters comprise, for instance, the initial learning rate, steps for 
learning-rate reduction, learning-rate decay, momentum, batch size, dropout rate, 
and number of iterations. Their configuration has to be determined according to the 
architecture and task of the CNN [FH19]. The search of an optimal hyperparameter 
configuration is called hyperparameter optimization (HO). 

HO is usually described as an optimization problem [FH19]. Thereby, the com- 
bined configuration space is defined as A = A; x Az X --- xX Ay, according to each 
domain ,,. Their individual spaces can be continuous, discrete, categorical, or binary. 

Hence, one aims to find an optimal hyperparameter configuration A* by minimiz- 
ing an objective function © (-), which evaluates a trained model F having parameters 
0 on the validation dataset DY"! with the loss J: 


A‘ = argmin O (J, F,0,D™", DY”). (1) 
ACA 


This problem statement is widely regarded in traditional machine learning and pri- 
marily based on Bayesian optimization (BO) in combination with Gaussian pro- 
cesses. However, a straightforward application to deep neural networks encounters 
problems due to a lack of scalability, flexibility, and robustness [FKH18, ZCY+19]. 
To exploit the benefits of BO, many authors proposed different combinations with 
other approaches. Hyperband [LJD+18] in combination with BO (BOHB) [FKH18] 
frames the optimization as “[...] a pure exploration non-stochastic infinite-armed 
bandit problem [...]”. The method of BO for iterative learning (BOIL) [NSO20] iter- 
atively internalizes collected information about the learning curve and the learning 
algorithm itself. The authors of [WTPFW19] introduce the trace-aware knowledge 
gradient (taKG) as an acquisition function for BO (BO-taKG) which “leverages 
both trace information and multiple fidelity controls”. Thereby BOIL and BO-taKG 
achieve state-of-the-art performance regarding CNNs outperforming hyperband. 

Other approaches such as the orthogonal array tuning method (OATM) [ZCY+19] 
or HO by reinforcement learning (Hyp-RL) [JGST19] turn away from the Bayesian 
approaches and offer new research directions. 

Finally, the insight that many authors include kernel sizes and number of kernels 
and layers in their hyperparameter configuration should be emphasized. More work 
should be spent on the distinct integration of HO in the performance estimation 
strategy of neural architecture search (cf. Sect. 9.3). 
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3.2 Modification of Loss 


There exist many approaches that aim at directly modifying the loss function with an 
objective of improving either adversarial or corruption robustness [PDZ18, KS18, 
LYZZ19, XCKW19, SLC19, WSC+20, SR20]. One of the earliest approaches for 
improving corruption robustness was introduced by Zheng et al. [ZSLG16] and is 
called stability training, where they introduce a regularization term that penalizes 
the network prediction to a clean and an augmented image. However, their approach 
does not scale to many augmentations at the same time. Janocha et al. [JC17] then 
introduced a detailed analysis on the influence of multiple loss functions to model 
performance as well as robustness and suggested that expectation-based losses tend 
to work better with noisy data and squared-hinge losses tend to work better for clean 
data. Other well-known approaches are mainly based on variations of data augmen- 
tation [CZM+19, CZSL19, ZCG+19, LYP+19], which can be computationally quite 
expensive. 

In contrast to corruption robustness, there exist many more approaches based 
on adversarial examples. We highlight some of the most interesting and relevant 
ones here. Mustafa et al. [CWL+19] propose to add a loss term that maximally 
separates class-wise feature map representations, hence increasing the distance from 
data points to the corresponding decision boundaries. Similarly, Pang et al. [PXD+20] 
proposed the Max-Mahalanobis center (MMC) loss to learn more structured repre- 
sentations and induce high-density regions in the feature space. Chen et al. [CBLR18] 
proposed a variation of the well-known cross-entropy (CE) loss that not only max- 
imizes the model probabilities of the correct class, but, in addition, also minimizes 
model probabilities of incorrect classes. Cisse et al. [CBG+17] constraint the Lips- 
chitz constant of different layers to be less than one which restricts the error propaga- 
tion introduced by adversarial perturbations to a DNN. Dezfooli et al. [MDFUF19] 
proposed to minimize the curvature of the loss surface locally around data points. 
They emphasize that there exists a strong correlation between locally small curvature 
and correspondingly high adversarial robustness. 

All of these methods highlighted above are evaluated mostly for image classifi- 
cation tasks on smaller datasets, namely, CIFAR-10 [Kri09], CIFAR-100 [Kri09], 
SVHN [NWC+11], and only sometimes on ImageNet [KSH12]. Very few approaches 
have been tested rigorously on complex safety-relevant tasks, such as object detec- 
tion, semantic segmentation, etc. Moreover, methods that improve adversarial robust- 
ness are only tested on a small subset of attack types under differing attack specifi- 
cations. This makes comparing multiple methods difficult. 

In addition, methods that improve corruption robustness are evaluated over a 
standard dataset of various corruption types which may or may not be relevant to its 
application domain. In order to assess multiple methods for their effect on safety- 
related aspects, a thorough robustness evaluation methodology is needed, which is 
largely missing in the current literature. This evaluation would need to take into 
account relevant disturbances/corruption types present in the real world (application 
domain) and have to assess robustness toward such changes in a rigorous manner. 
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Without such an evaluation, we run into the risk of being overconfident in our network, 
thereby harming safety. 


3.3 Domain Generalization 


Domain generalization (DG) can be seen as an extreme case of domain adaptation 
(DA). The latter is a type of transfer learning, where the source and target tasks are the 
same (e.g., shared class labels) but the source and target domains are different (e.g., 
another image acquisition protocol or a different background) [Csul17, WYKN20]. 
The DA can be either supervised (SDA), where there is little available labeled data 
in the target domain, or unsupervised (UDA), where data in the target domain is not 
labeled. The DG goes one step further by assuming that the target domain is entirely 
unknown. Thus, it seeks to solve the train-test domain shift in general. While DA is 
already an established line of research in the machine learning community, DG is 
relatively new [MBS 13], though with an extensive list of papers in the last few years. 

Probably, the first intuitive solution that one may think of to implement DG is neu- 
tralizing the domain-specific features. It was shown in [WHLX 19] that the gray-level 
co-occurrence matrices (GLCM) tend to perform poorly in semantic classification 
(e.g., digit recognition) but yield good accuracy in textural classification compared 
to other feature sets, such as speeded up robust features (SURF) and local binary 
patterns (LBP). DG was thus implemented by decorrelating the model’s decision 
from the GLCM features of the input image even without the need of domain labels. 

Besides the aforementioned intensity-based statistics of an input image, it is 
known that characterizing image style can be done based on the correlations between 
the filter responses of a DNN layer [GEB 16] (neural style transfer). In [SMK20], the 
training images are enriched with stylized versions, where a style is defined either 
by an external style (e.g., cartoon or art) or by an image from another domain. Here, 
DG is addressed as a data augmentation problem. 

Some approaches [LTG+18, MH20a] try to learn generalizable latent represen- 
tations by a kind of adversarial training. This is done by a generator or an encoder, 
which is trained to generate a hidden feature space that maximizes the error of a 
domain discriminator but at the same time minimizes the classification error of the 
task of concern. Another variant of adversarial training can be seen in [LPWK18], 
where an adversarial autoencoder [MSJ+16] is trained to generate features, which a 
discriminator cannot distinguish from random samples drawn from a prior Laplace 
distribution. This regularization prevents the hidden space from overfitting to the 
source domains, in a similar spirit to how variational autoencoders do not leave gaps 
in the latent space. In [MH20a], it is argued that the domain labels needed in such 
approaches are not always well-defined or easily available. Therefore, they assume 
unknown latent domains which are learned by clustering in a space similar to the 
style-transfer features mentioned above. The pseudo-labels resulting from clustering 
are then used in the adversarial training. 
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Autoencoders have been employed for DG not only in an adversarial setup, but 
also in the sense of multi-task learning nets [Car97], where the classification task 
in such nets is replaced by a reconstruction one. In [GKZB15], an autoencoder is 
trained to reconstruct not only the input image but also the corresponding images in 
the other domains. 

In the core of both DA and DG, we are confronted with a distribution matching 
problem. However, estimating the probability density in high-dimensional spaces 
is intractable. The density-based metrics, such as Kullback—Leibler divergence, are 
thus not directly applicable. In statistics, the so-called two-sample tests are usually 
employed to measure the distance between two distributions in a point-wise manner, 
i.e., without density estimation. For deep learning applications, these metrics need not 
only to be point-wise but also differentiable. The two-sample tests were approached in 
the machine learning literature using (differentiable) K-NNs [DK17], classifier two- 
sample tests (C2ST) [LO17], or based on the theory of kernel methods [SGSS07]. 
More specifically, the maximum mean discrepancy (MMD) [GBR+06, GBR+12], 
which belongs to the kernel methods, is widely used for DA [GKZ14, LZWJ17, 
YDL+17, YLW+20] but also for DG [LPWK18]. Using the MMD, the distance 
between two samples is estimated based on pair-wise kernel evaluations, e.g., the 
radial basis function (RBF) kernel. 

While the DG approaches generalize to domains from which zero shots are avail- 
able, the so-called zero-shot learning (ZSL) approaches generalize to tasks (e.g., new 
classes in the same source domains) for which zero shots are available. Typically, the 
input in ZSL is mapped to a semantic vector per class instead of a simple class label. 
This can be, for instance, a vector of visual attributes [LNH14] or a word embedding 
of the class name [KXG17]. A task (with zero shots at training time) can be then 
described by a vector in this space. In [MARC20], there is an attempt to combine 
ZSL and DG in the same framework in order to generalize to new domains as well 
as new tasks, which is also referred to as heterogeneous domain generalization. 

Note that most discussed approaches for DG require non-standard handling, i.e., 
modifications to models, data, and/or the optimization procedure. This issue poses 
a serious challenge as it limits the practical applicability of these approaches. There 
is a line of research which tries to address this point by linking DG to other machine 
learning paradigms, especially the model-agnostic meta-learning (MAML) [FAL17] 
algorithm, in an attempt to apply DG in a model-agnostic way. Loosely speaking, 
a model can be exposed to simulated train-test domain shift by training on a small 
support set to minimize the classification error on a small validation set. This can be 
seen as an instance of a few-shot learning (FSL) problem [WYKN20]. Moreover, the 
procedure can be repeated on other (but related) FSL tasks (e.g., different classes) 
in what is known as episodic training. The model transfers its knowledge from one 
task to another task and learns how to learn fast for new tasks. Thus, this can be seen 
as a meta-learning objective [HAMS20] (in a FSL setup). Since the goal of DG is 
to adapt to new domains rather than new tasks, several model-agnostic approaches 
[LYSH18, BSC18, LZY+19, DdCKG19] try to recast this procedure in a DG setup. 
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4 Adversarial Attacks 


Over the last few years, deep neural networks (DNNs) consistently showed state- 
of-the-art performance across several vision-related tasks. While their superior per- 
formance on clean data is indisputable, they show a lack of robustness to certain 
input patterns, denoted as adversarial examples [SZS+14]. In general, an algorithm 
for creating adversarial examples is referred to as an adversarial attack and aims at 
fooling an underlying DNN, such that the output changes in a desired and malicious 
way. This can be carried out without any knowledge about the DNN to be attacked 
(black-box attack) [MDFF16, PMG+17], or with full knowledge about the param- 
eters, architecture, or even training data of the respective DNN (white-box attack) 
[GSS15, CW17a, MMS+18]. While initially being applied on simple classification 
tasks, some approaches aim at finding more realistic attacks [TVRG19, JLS+20], 
which particularly pose a threat to safety-critical applications, such as DNN-based 
environment perception systems in autonomous vehicles. Altogether, this motivated 
the research in finding ways of defending against such adversarial attacks [GSS15, 
GRCvdM18, MDFUF19, XZZ+19]. In this section, we introduce the current state of 
research regarding adversarial attacks in general, more realistic adversarial attacks 
closely related to the task of environment perception for autonomous driving, and 
strategies for detecting or defending adversarial attacks. We conclude each section 
by clarifying current challenges and research directions. 


4.1 Adversarial Attacks and Defenses 


The term adversarial example was first introduced by Szegedy et al. [SZS+14]. From 
there on, many researchers tried to find new ways of crafting adversarial exam- 
ples more effectively. Here, the fast gradient sign method (FGSM) [GSS15], Deep- 
Fool [MDFF16], least-likely class method (LLCM) [KGB17a, KGB17b], C&W 
[CW17b], momentum iterative fast gradient sign method (MI-FGSM) [DLP+18], 
and projected gradient descent (PGD) [MMS+18] are a few of the most famous 
attacks so far. In general, these attacks can be executed in an iterative fashion, where 
the underlying adversarial perturbation is usually bounded by some norm and is 
following additional optimization criteria, e.g., minimizing the number of changed 
pixels. 

The mentioned attacks can be further categorized as image-specific attacks, where 
for each image a new perturbation needs to be computed. On the other hand, image- 
agnostic attacks aim at finding a perturbation, which is able to fool an underlying 
DNN on a set of images. Such a perturbation is also referred to as a universal adver- 
sarial perturbation (UAP). Here, the respective algorithm UAP [MDFFF17], fast 
feature fool (FFF) [MGB17], and prior driven uncertainty approximation (PD-UA) 
[LJL+19] are a few honorable mentions. Although the creation process of a univer- 
sal adversarial perturbation typically relies on a white-box setting, they show a high 
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transferability across models [HAMFs22]. This allows for black-box attacks, where 
one model is used to create a universal adversarial perturbation, and another model 
is being attacked with the beforehand-created perturbation. Universal adversarial 
perturbations are investigated in more detail in Chapter “Improving Transferability 
of Generated Universal Adversarial Perturbations for Image Classification and Seg- 
mentation” [HAMFs22], where a way to construct these perturbations on the task of 
image classification and segmentation is presented. In detail, the low-level feature 
extraction layer is attacked via an additional loss term to generate universal adver- 
sarial perturbations more effectively. Another way of designing black-box attacks 
is to create a surrogate DNN, which mimics the respective DNN to be attacked 
and thus can be used in the process of adversarial example creation [PMG+17]. On 
the contrary, some research has been done to create completely incoherent images 
(based on evolutionary algorithms or gradient ascent) to fool an underlying DNN 
[NYC15]. Different from that, another line of work has been proposed to alter only 
some pixels in images to attack a respective model. Here [NK17] has used opti- 
mization approaches to perturb some pixels in images to produce targeted attacks, 
aiming at a specific class output, or non-targeted attacks, aiming at outputting a class 
different from the network output or the ground truth. This can be extended up to 
finding one pixel in the image to be exclusively perturbed to generate adversarial 
images [NK17, SVS19]. The authors of [BF17, SBMC17, PKGB18] proposed to 
train generative models to generate adversarial examples. Given an input image and 
the target label, a generative model is trained to produce adversarial examples for 
DNNs. However, while the produced adversarial examples look rather unrealistic to 
a human, they are able to completely deceive a DNN. 

The existence of adversarial examples not only motivated research in finding new 
attacks, but also in finding defense strategies to effectively defend these attacks. Espe- 
cially for safety-critical applications, such as DNN-based environment perception 
for autonomous driving, the existence of adversarial examples needs to be handled 
accordingly. Similar to adversarial attacks, one can categorize defense strategies into 
two types: model-specific defense strategies and model-agnostic defense strategies. 
The former refers to defense strategies, where the model of interest is modified in 
certain ways. The modification can be done on the architecture, training procedure, 
training data, or model weights. On the other hand, model-agnostic defense strategies 
consider the model to be a black box. Here, only the input or the output is accessi- 
ble. Some well-known model-specific defense strategies include adversarial training 
[GSS15, MMS+18], the inclusion of robustness-oriented loss functions during train- 
ing [CLC+19, MDFUF19, KYL+20], removing adversarial patterns in features by 
denoising layers [HRF19, MKH+19, XWvdM+19], and redundant teacher-student 
frameworks [BHSFs19, BKV+20]. Majority of model-agnostic defense strategies 
primarily focus on various kinds of (gradient masking) pre-processing strategies 
[GRCvdM18, BFW+19, GR19, JWCF19, LLL+19, RSFM19, TCBZ19]. The idea 
is to remove the adversary from the respective image, such that the image is trans- 
formed from the adversarial space back into the clean space. 

Nonetheless, Athalye et al. [ACW 18] showed that gradient masking alone is not 
a sufficient criterion for a reliable defense strategy. In addition, detection and out-of- 
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distribution techniques have also been proposed as model-agnostic defense strategies 
against adversarial attacks. Here, the Mahalanobis distance [LLLS18] or area under 
the receiver operating characteristic curve (AUROC) and area under the precision- 
recall curve (AUPR) [HG17] are used to detect adversarial examples. The authors 
of [HG17, LLLS17, MCKBF17], on the other hand, proposed to train networks to 
detect, whether the input image is out-of-distribution or not. 

Moreover, Feinman et al. [FCSG17] proved that adversarial attacks usually pro- 
duce high uncertainty on the output of the DNN. As a consequence, they proposed to 
use the dropout technique to estimate uncertainty on the output to identify a possible 
adversarial attack. Regarding adversarial attacks, majority of the listed attacks are 
designed for image classification. Only a few adversarial attacks consider tasks that 
are closely related to autonomous driving, such as bounding box detection, seman- 
tic segmentation, instance segmentation, or even panoptic segmentation. Also, the 
majority of the adversarial attacks rely on a white-box setting, which is usually not 
present for a potential attacker. Especially universal adversarial perturbations have 
to be considered as a real threat due to their high model transferability. Generally 
speaking, the existence of adversarial examples has not been thoroughly studied yet. 
An analytical interpretation is still missing, but could help in designing more mature 
defense strategies. 

Regarding defense strategies, adversarial training is still considered as one of the 
most effective ways of increasing the robustness of a DNN. Nonetheless, while adver- 
sarial training is indeed effective, it is rather inefficient in terms of training time. In 
addition, model-agnostic defenses should be favored as once being designed, they can 
be easily transferred to different models. Moreover, as most model-agnostic defense 
strategies rely on gradient-masking and it has been shown that gradient-masking is 
not a sufficient property for a defense strategy, new ways of designing model-agnostic 
defenses should be taken into account. Furthermore, out-of-distribution and adversar- 
ial attacks detection or even correction methods have been a new trend for identifying 
attacks. However, as the environment perception system of an autonomous driving 
vehicle could rely on various information sources, including LiDAR, optical flow, 
or depth from a stereo camera, techniques of information fusion should be further 
investigated to mitigate or even eliminate the effect of adversarial examples. 


4.2 More Realistic Attacks 


We consider the following two categories of realistic adversarial attacks: (1) image- 
level attacks, which not only fool a neural network but also pose a provable threat to 
autonomous vehicles and (2) attacks which have been applied in a real world or in a 
simulation environment, such as car learning to act (CARLA) [DRC+17]. 

Some notable examples in the first category of attacks include attacks on semantic 
segmentation [MCKBF17] or person detection [TVRG19]. 

In the second group of approaches, the attacks are specifically designed to survive 
real-world distortions, including different distances, weather and lighting conditions, 
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as well as camera angles. For this, adversarial perturbations are usually concentrated 
in a specific image area, called adversarial patch. Crafting an adversarial patch 
involves specifying a patch region in each training image, applying transformations 
to the patch, and iteratively changing the pixel values within this region to maximize 
the network prediction error. The latter step typically relies on an algorithm, pro- 
posed for standard adversarial attacks, which aim at crafting invisible perturbations 
while misleading neural networks, e.g., C&W [CW17b], Jacobian-based saliency 
map attack (JSMA) [PMG+17], and PGD [MMS+18]. 

The first printable adversarial patch for image classification was described by 
Brown et al. [BMR+17]. Expectation over transformations (EOT) [AEIK18] is one of 
the influential updates to the original algorithm—it permits to robustify patch-based 
attacks to distortions and affine transformations. Localized and visible adversarial 
noise (LaVAN) [KZG18] is a further method to generate much smaller patches (up to 
2% of the pixels in the image). In general, fooling image classification with a patch 
is a comparatively simple task, because adversarial noise can mimic an instance of 
another class and thus lower the prediction probability for a true class. 

Recently, patch-based attacks for a more tricky task of object detection have been 
described [LYL+19, TVRG19]. Also, Lee and Kolter [LK19] generate a patch using 
PGD [MMS+18], followed by EOT applied to the patch. With this approach, all 
detections in an image can be successfully suppressed, even without any overlap 
of a patch with bounding boxes. Furthermore, several approaches for generating an 
adversarial T-shirt have been proposed, including [XZL+20, WLDG20]. 

DeepBillboard [ZLZ+20] is the first attempt to attack end-to-end driving mod- 
els with adversarial patches. The authors propose to generate a single patch for a 
sequence of input images to mislead four steering models, including DAVE-2 in a 
drive-by scenario. 

Apart from physical feasibility, inconspicuousness is crucial for a realistic attack. 
Whereas adversarial patches usually look like regions of noise, several works have 
explored attacks with an inconspicuous patch. In particular, Eykholt et al. [EEF+18] 
demonstrate the vulnerability of road sign classification to the adversarial pertur- 
bations in the form of only black and white stickers. In [BHG+19], an end-to-end 
driving model is attacked in CARLA by painting of black lines on the road. Also, 
Kong and Liu [KGLL20] use a generative adversarial network to get a realistic bill- 
board to attack an end-to-end driving model in a drive-by scenario. In [DMW-+20], a 
method to hide visible adversarial perturbations with customized styles is proposed, 
which leads to adversarial traffic signs that look unsuspicious to a human. Current 
research mostly focuses on attacking image-based perception of an autonomous 
vehicle. Adversarial vulnerability of further components of an autonomous vehicle, 
e.g., LiDAR-based perception, optical flow, and depth estimation, has only recently 
gained attention. Furthermore, most attacks consider only a single component of an 
autonomous driving pipeline, the question whether the existing attacks are able to 
propagate to further pipeline stages has not been studied yet. The first work in this 
direction [JLS+20] describes an attack on object detection and tracking. The eval- 
uation is, however, limited to a few clips, where no experiments in the real world 
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have been performed. Overall, the research on realistic adversarial attacks, especially 
combined with physical tests, is currently in the starting phase. 


5 Interpretability 


Neural networks are, by their nature, black boxes and therefore intrinsically hard to 
interpret [Tay06]. Due to their unrivaled performance, they still remain first choice 
for advanced systems even in many safety-critical areas, such as level 4 automated 
driving. This is why the research community has invested considerable effort to 
unhinge the black-box character and make deep neural networks more transparent. 

We can observe three strategies that provide different viewpoints toward this goal 
in the state of the art. First is the most direct approach of opening up the black box 
and looking at intermediate representations. Being able to interpret individual layers 
of the system facilitates interpretation of the whole. The second approach tries to 
provide interpretability by explaining the network’s decisions with pixel attributions. 
Aggregated explanations of decisions can then lead to interpretability of the system 
itself. Third is the idea of approximating the network with interpretable proxies to 
benefit from the deep neural networks performance while allowing interpretation via 
surrogate models. Underlying all aspects here is the area of visual analytics. 

There exists earlier research in the medical domain to help human experts under- 
stand and convince them of machine learning decisions [CLG+15]. Legal require- 
ments in the finance industry gave rise to interpretable systems that can justify their 
decisions. An additional driver for interpretability research was the concern for Clever 
Hans predictors [LWB+19]. 


5.1 Visual Analytics 


Traditional data science has developed a huge tool set of automated analysis processes 
conducted by computers, which are applied to problems that are well defined in the 
sense that the dimensionality of input and output as well as the size of the dataset they 
rely on is manageable. For those problems that in comparison are more complex, 
the automation of the analysis process is limited and/or might not lead to the desired 
outcome. This is especially the case with unstructured data like image or video data in 
which the underlying information cannot directly be expressed by numbers. Rather, 
it needs to be transformed to some structured form to enable computers to perform 
some task of analysis. Additionally, with an ever-increasing amount of various types 
of data being collected, this “information overload” cannot solely be analyzed by 
automatic methods [KAF+08, KMT09]. 

Visual analytics addresses this challenge as “the science of analytical reasoning 
facilitated by interactive visual interfaces” [TCO5]. Visual analytics therefore does 
not only focus on either computationally processing data or visualizing results but 
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coupling both tightly with interactive techniques. Thus, it enables an integration of the 
human expert into the iterative visual analytics process: through visual understand- 
ing and human reasoning, the knowledge of the human expert can be incorporated 
to effectively refine the analysis. This is of particular importance, where a stringent 
safety argumentation for complex models is required. With the help of visual analyt- 
ics, the line of argumentation can be built upon arguments that are understandable 
for humans. To include the human analyst efficiently into this process, a possible 
guideline is the visual analytics mantra by Keim: “Analyze first, show the important, 
zoom, filter and analyze further, details on demand” [KAF+08].! 

The core concepts of visual analytics therefore rely on well-designed interactive 
visualizations, which support the analyst in the tasks of, e.g., reviewing, understand- 
ing, comparing, and inferring not only the initial phenomenon or data but also the 
computational model and its results itself with the goal of enhancing the analytical 
process. Driven by various fields of application, visual analytics is a multidisciplinary 
field with a wide variety of task-oriented development and research. Recent work 
has been done in several areas: depending on the task, there exist different pipeline 
approaches to create whole visual analytics systems [WZM-+16]; the injection of 
human expert knowledge into the process of determining trends and patterns from 
data is the focus of predictive visual analytics [LCM+17, LGH+17]; enabling the 
human to explore high-dimensional data [LMW-+17] interactively and visually (e.g., 
via dimensionality reduction [SZS+17]) is a major technique to enhance the under- 
standability of complex models (e.g., neural networks); the iterative improvement 
and the understanding of machine learning models is addressed by using interac- 
tive visualizations in the field of general machine learning [LWLZ17], or the other 
way round: using machine learning to improve visualizations and guidance based 
on user interactions [ERT+17]. Even more focused on the loop of simultaneously 
developing and refining machine learning models is the area of interactive machine 
learning, where the topics of interface design [DK18] and the importance of users 
[ACKK14, SSZ+17] are discussed. One of the current research directions is using 
visual analytics in the area of deep learning [GTC+18, HKPC18, CL18]. However, 
due to the interdisciplinarity of visual analytics, there are still open directions and 
ongoing research opportunities. 

Especially in the domain of neural networks and deep learning, visual analytics 
is a relatively new approach in tackling the challenge of explainability and inter- 
pretability of those often called black boxes. To enable the human to better interact 
with the models, research is done in enhancing the understandability of complex 
deep learning models and their outputs with the use of proper visualizations. Other 
research directions attempt to achieve improving the trustability of the models, giv- 
ing the opportunity to inspect, diagnose, and refine the model. Further, possible areas 
for research are online training processes and the development of interactive systems 
covering the whole process of training, enhancing, and monitoring machine learning 
models. Here, the approach of mixed guidance, where system-initiated guidance is 


' Extending the original visualization mantra by Shneiderman “Overview first, filter and zoom, 
details on demand” [Shn96]. 
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combined with user-initiated guidance, is discussed among the visual analytics com- 
munity as well. Another challenge and open question is creating ways of comparing 
models to examine which model yields a better performance, given specific situations 
and selecting or combining the best models with the goal of increasing performance 
and overall safety. 


5.2 Intermediate Representations 


In general, representation learning [BCV13] aims to extract lower dimensional fea- 
tures in latent space from higher dimensional inputs. These features are then used as 
an effective representation for regression, classification, object detection, and other 
machine learning tasks. Preferably, latent features should be disentangled, meaning 
that they represent separate factors found in the data that are statistically indepen- 
dent. Due to their importance in machine learning, finding meaningful intermediate 
representations has long been a primary research goal. Disentangled representations 
can be interpreted more easily by humans and can, for example, be used to explain 
the reasoning of neural networks [HDR18]. 

Among the longer known methods for extracting disentangled representations are 
principal component analysis (PCA) [FP78, JC 16], independent component analysis 
[HOOO], and nonnegative matrix factorization [BBL+07]. PCA is highly sensitive 
to outliers and noise in the data. Therefore, more robust algorithms were proposed. 
In [SBS12] already a small neural network was used as an encoder and the algo- 
rithm proposed in [FX Y 12] can deal with high-dimensional data. Some robust PCA 
algorithms are provided with analytical performance guarantees [XCS10, RA17, 
RL19]. 

A popular method for representation learning with deep networks is the variational 
autoencoder (VAE) [KW14]. An important generalization of the method is the 8- 
VAE variant [HMP+17], which improved the disentanglement capability [FAA18]. 
Later analysis added to the theoretical understanding of G- VAE [BHP+18, SZYP19, 
KP20]. Compared to standard autoencoders, VAEs map inputs to a distribution, 
instead of mapping them to a fixed vector. This allows for additional regularization 
of the training to avoid overfitting and ensure good representations. In G-VAEs, the 
trade-off between reconstruction quality and disentanglement can be fine-tuned by 
the hyperparameter (3. 

Different regularization schemes have been suggested to improve the VAE 
method. Among them are Wasserstein autoencoders [TBGS19, XW19], attribute 
regularization [PL20], and relational regularization [XLH+20]. Recently, a connec- 
tion between VAEs and nonlinear independent component analysis was established 
[KKMH20] and then expanded [SRK20]. 

Besides VAEs, deep generative adversarial networks can be used to construct 
latent features [SLY 15, CDH+16, MSJ+16]. Other works suggest centroid encoders 
[GK20] or conditional learning of Gaussian distributions [SYZ+21] as alternatives 
to VAEs. In [KWG+18], concept activation vectors are defined as being orthogonal 
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to the decision boundary of a classifier. Apart from deep learning, entirely new archi- 
tectures, such as capsule networks [SFH17], might be used to disassemble inputs. 

While many different approaches for disentangling exist, the feasibility of the 
task is not clear yet and a better theoretical understanding is needed. The disentan- 
gling performance is hard to quantify, which is only feasible with information about 
the latent ground truth [EW18]. Models that overly rely on single directions, single 
neurons in fully connected networks, or single feature maps in CNNs have the ten- 
dency to overfit [MBRB18]. According to [LBL+19], unsupervised learning does not 
produce good disentangling and even small latent spaces do not reduce the sample 
complexity for simple tasks. This is in direct contrast to newer findings that show a 
decreased sample complexity for more complex visual downstream tasks [vLSB20]. 
So far, it is unclear if disentangling improves the performance of machine learning 
tasks. 

In order to be interpretable, latent disentangled representations need to be aligned 
with human understandable concepts. In [EIS+19], training with adversarial exam- 
ples was used and the learned representations were shown to be more aligned with 
human perception. For explainable AI, disentangling alone might not be enough to 
generate interpretable output, and additional regularization could be needed. 

In Chapter “Invertible Neural Networks for Understanding Semantics of Invari- 
ances of CNN Representations” [REBO22], a new approach is presented, which aims 
at extracting invariant representations that are present in a trained network. To this 
end, invertible neural networks are deployed to recover these invariances and to map 
them to accessible semantic concepts. These concepts allow for both interpretation of 
an inner model representation and means to carefully manipulate them and examine 
the effect on the prediction. 


5.3 Pixel Attribution 


The non-linearity and complexity of DNNs allow them to solve perception problems, 
such as detecting a pedestrian, that cannot be specified in detail. At the same time, 
the automatic extraction of features given in an input image and the mapping to the 
respective prediction is counterintuitive and incomprehensible for humans, which 
makes it hard to argue safety for a neural-network-based perception task. Feature 
importance techniques are currently predominantly used to diagnose the causes of 
incorrect model behaviors [BXS+20]. So-called attribution maps are a visual tech- 
nique to express the relationship between relevant pixels in the input image and the 
network’s prediction. Regions in an image that contain relevant features are high- 
lighted accordingly. Attribution approaches mostly map to one of three categories. 

Gradient-based and activation-based approaches (such as [SVZ14, SDBR14, 
BBM+15, MLB+17, SCD+20, STK+17, SGK19] among others) rely on the gradient 
of the prediction with respect to the input. Regions that were most relevant for the 
prediction are highlighted. Activation-based approaches relate the feature maps of 
the last convolutional layer to output classes. 
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Perturbation-based approaches [ZF14, FV17, ZCAW17, HMK+19] suggest 
manipulating the input. If the prediction changes significantly, the input may hold a 
possible explanation at least. 

While gradient-based approaches are oftentimes faster in computation, 
perturbation-based approaches are much easier to interpret. As many studies have 
shown [STY17, AGM+20], there is still a lot of research to be done before attribution 
methods are able to robustly provide explanations for model predictions, in partic- 
ular, erroneous behavior. One key difficulty is the lack of an agreed-upon definition 
of a good attribution map including important properties. Even between humans, 
it is hard to agree on what a good explanation is due to its subjective nature. This 
lack of ground truth makes it hard or even impossible to quantitatively evaluate an 
explanation method. Instead, this evaluation is done only implicitly. One typical way 
to do this is the axiomatic approach. Here a set of desiderata of an attribution method 
are defined, on which different attribution methods are then evaluated. Alternatively, 
different attribution methods may be compared by perturbing the input features start- 
ing with the ones deemed most important and measuring the drop in accuracy of the 
perturbed models. The best method will result into the greatest overall loss in accu- 
racy as the number of inputs are omitted [ACOG17]. Moreover, for gradient-based 
methods it is hard to assess if an unexpected attribution is caused by a poorly per- 
forming network or a poorly performing attribution method [FV17]. How to cope 
with negative evidence, i.e., the object was predicted because a contrary clue in the 
input image was missing, is an open research question. Additionally, most methods 
were shown on classification tasks until now. It remains to be seen how they can 
be transferred to object detection and semantic segmentation tasks. In the case of 
perturbation-based methods, the high computation time and single-image analysis 
inhibit widespread application. 


5.4 Interpretable Proxies 


Neural networks are capable of capturing complicated logical (cor)relations. How- 
ever, this knowledge is encoded on a sub-symbolic level in the form of learned weights 
and biases, meaning that the reasoning behind the processing chain cannot be directly 
read out or interpreted by humans [CCO98]. To explain the sub-symbolic process- 
ing, one can either use attribution methods (cf. Sect.5.3), or lift this sub-symbolic 
representation to a symbolic one [GB Y+18], meaning a more interpretable one. Inter- 
pretable proxies or surrogate models try to achieve the latter: the DNN behavior is 
approximated by a model that uses symbolic knowledge representations. Symbolic 
representations can be linear models like local interpretable model-agnostic explana- 
tions (LIME) [RSG16] (proportionality), decision trees (if-then chains) [GBY+18], 
or loose sets of logical rules. Logical connectors can simply be AND and OR but 
also more general ones like at-least-M-of-N [CCO98]. The expressiveness of an 
approach refers to the logic that is used: Boolean-only versus first-order logic, and 
binary versus fuzzy logic truth values [TAGD98]. Other than attribution methods (cf. 
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Sect.5.3), these representations can capture combinations of features and (spatial) 
relations of objects and attributes. As an example, consider “eyes are closed” as 
explanation for “person asleep”: attribution methods only could mark the location of 
the eyes, dismissing the relations of the attributes [RSS18]. All mentioned surrogate 
model types (linear, set of rules) require interpretable input features in order to be 
interpretable themselves. These features must either be directly obtained from the 
DNN input or (intermediate) output, or automatically be extracted from the DNN 
representation. Examples for extraction are the super-pixeling used in LIME for input 
feature detection, or concept activation vectors [KWG+18] for DNN representation 
decoding. 

Quality criteria and goals for interpretable proxies are [TAGD98]: accuracy of the 
standalone surrogate model on unseen examples, fidelity of the approximation by the 
proxy, consistency with respect to different training sessions, and comprehensibility 
measured by the complexity of the rule set (number of rules, number of hierarchical 
dependencies). The criteria are usually in conflict and need to be balanced: Better 
accuracy may require a more complex, thus less expressive set of rules. 

Approaches for interpretable proxies differ in the validity range of the representa- 
tions: Some aim for surrogates that are only valid locally around specific samples, like 
in LIME [RSG16] or in [RSS 18] via inductive logic programming. Other approaches 
try to be more globally approximate aspects of the model behavior. Another cate- 
gorization is defined by whether full access (white box), some access (gray box), or 
no access (black box) to the DNN internals is needed. One can further differentiate 
between post hoc approaches that are applied to a trained model, and approaches 
that try to integrate or enforce symbolic representations during training. Post hoc 
methods cover the wide field of rule extraction techniques for DNNs. The inter- 
ested reader may refer to [AK12, Hail6]. Most white- and gray-box methods try 
to turn the DNN connections into if-then rules that are then simplified, like done 
in DeepRED [ZLMJ16]. A black-box example is validity interval analysis [Thr95], 
which refines or generalizes rules on input intervals, either starting from one sample 
or a general set of rules. Enforcement of symbolic representations can be achieved 
by enforcing an output structure that provides insights to the decision logic, such as 
textual explanations or a rich output structure allowing investigation of correlations 
[XLZ+18]. An older discipline for enforcing symbolic representations is the field of 
neural-symbolic learning [SSZ19]. The idea is based on a hybrid learning cycle in 
which a symbolic learner and a DNN iteratively update each other via rule insertion 
and extraction. The comprehensibility of global surrogate models suffers from the 
complexity and size of concurrent DNNs. Thus, stronger rule simplification meth- 
ods are required [Hail6]. The alternative direction of local approximations mostly 
concentrates on linear models instead of more expressive rules [Thr95, RSS18]. Fur- 
thermore, balancing of the quality objectives is hard since available indicators for 
interpretability may not be ideal. And lastly, applicability is heavily infringed by 
the requirement of interpretable input features. These are usually not readily avail- 
able from input (often pixel level) or DNN output. Supervised extraction approaches 
vary in their fidelity, and unsupervised ones do not guarantee to yield meaningful or 
interpretable results, respectively, such as the super-pixel clusters of LIME. 
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6 Uncertainty 


Uncertainty refers to the view that a neural network is not conceived as a deterministic 
function but as a probabilistic function or estimator, delivering a random distribution 
for each input point. Ideally, the mean value of the distribution should be as close as 
possible to the ground-truth value of the function being approximated by the neural 
network and the uncertainty of the neural network refers to its variance when being 
considered as a random variable, thus allowing to derive a confidence with respect 
to the mean value. Regarding safety, the variance may lead to estimations about 
the confidence associated with a specific network output and opens the option for 
discarding network outputs with insufficient confidence. 

There are roughly two broad approaches for training neural networks as proba- 
bilistic functions: parametric approaches [KG17] and Bayesian neural networks on 
the one hand, such as [BCKW 15], where the transitions along the network edges are 
modeled as probability distributions, and ensemble-based approaches on the other 
hand [LPB17, SOF+19], where multiple networks are trained and considered as sam- 
ples of a common output distribution. Apart from training as probabilistic function, 
uncertainty measures have been derived from single, standard neural networks by 
post-processing on the trained network logits, leading, for example, to calibration 
measures (cf. e.g., [SOF+19]). 


6.1 Generative Models 


Generative models belong to the class of unsupervised machine learning models. 
From a theoretical perspective, these are particularly interesting, because they offer a 
way to analyze and model the density of data. Given a finite dataset D independently 
distributed according to some distribution p(x), x € D, generative models aim to 
estimate or enable sampling from the underlying density p(x) in a model F(x, 8). 
The resulting model can be used for data indexing [Wes04], data retrieval [ML 11], for 
visual recognition [KSH12], speech recognition and generation [HDY+12], language 
processing [KM02, CE17], and robotics [Thr02]. Following [OE18], we can group 
generative models into two main classes: 


e Cost function-based models, such as autoencoder [KW14, Doe16], deep belief 
networks [Hin09], and generative adversarial networks [GPAM+14, RMC15, 
Gool7]. 

e Energy-based models [LCH+07, SH09], where the joint probability density is 
modeled by an energy function. 


Besides these deep learning approaches, generative models have been studied in 
machine learning in general for quite some time (cf. [Fry77, Wer78, Sil86, JMS96, 
She04, Scol5, Gral8]). A very prominent example of generative networks is 
Gaussian process [WR96, Mac98, WB98, Ras03] and their deep learning exten- 
sions [DL13, BHLHL+16] as generative models. 
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An example of a generative model being employed for image segmentation uncer- 
tainty estimation is the probabilistic U-Net [KRPM+18]. Here a variational autoen- 
coder (VAE) conditioned on the image is trained to model uncertainties. Samples 
from the VAE are fed into a segmentation U-Net which can thus give different results 
for the same image. This was tested in context of medical images, where inter-rater 
disagreements lead to uncertain segmentation results and Cityscapes segmentation. 
For the Cityscapes segmentation, the investigated use case was label ambiguity (e.g., 
isa BMW X7 acar or a van) using artificially created, controlled ambiguities. Results 
showed that the probabilistic U-Net could reproduce the segmentation ambiguity 
modes more reliably than competing methods, such as a dropout U-Net which is 
based on techniques elaborated in the next section. 


6.2 Monte Carlo Dropout 


A widely used technique to estimate model uncertainty is Monte Carlo (MC) dropout 
[GG16] that offers a Bayesian motivation, conceptual simplicity, and scalability to 
application-size networks. This combination distinguishes MC dropout from com- 
peting Bayesian neural network (BNN) approximations (e.g., [RBB18, BCKW15], 
see Sect.6.3). However, these approaches and MC dropout share the same goal: to 
equip neural networks with a self-assessment mechanism that detects unknown input 
concepts and thus potential model insufficiencies. 

On a technical level, MC dropout assumes prior distributions on network acti- 
vations, usually independent and identically distributed (i.i.d.) Bernoulli distri- 
butions. Model training with iteratively drawn Bernoulli samples, the so-called 
dropout masks, then yields a data-conditioned posterior distribution within the cho- 
sen parametric family. It is interesting to note that this training scheme was used 
earlier—independent from an uncertainty context—for better model generalization 
[SHK+14]. At inference, sampling provides estimates of the input-dependent output 
distributions. The spread of these distributions is then interpreted as the prediction 
uncertainty that originates from limited knowledge of model parameters. Borrowing 
“frequentist” terms, MC dropout can be considered as an implicit network ensemble, 
i.e., as a set of networks that share (most of) their parameters. 

In practice, MC dropout requires only a minor modification of the optimization 
objective during training and multiple, trivially parallelizable forward passes during 
inference. The loss modification is largely agnostic to network architecture and does 
not cause substantial overhead. This is in contrast to the sampling-based inference 
that increases the computational effort massively—by estimated factors of 20—100 
compared to networks without MC dropout. A common practice is therefore the use 
of last-layer dropout [SOF+19] that reduces computational overhead to estimated 
factors of 2-10. Alternatively, analytical moment propagation allows sampling-free 
MC dropout inference at the price of additional approximations (e.g., [PFC+19]). 
Further extensions of MC dropout target the integration of data-inherent (aleatoric) 
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uncertainty [KG17] and tuned performance by learning layer-specific dropout rate 
using concrete relaxations [GHK17]. 

The quality of MC dropout uncertainties is typically evaluated using negative log- 
likelihood (NLL), expected calibration error (ECE) and its variants (cf. Sect. 6.6) and 
by considering correlations between uncertainty estimates and model errors (e.g., 
AUSE [ICG+18]). Moreover, it is common to study how useful uncertainty esti- 
mates are for solving auxiliary tasks like out-of-distribution classification [LPB17] 
or robustness w.r.t. adversarial attacks. 

MC dropout is a working horse of safe ML, being used with various networks and 
for a multitude of applications (e.g., [BFS18]). However, several authors pointed out 
shortcomings and limitations of the method: MC dropout bears the risk of overconfi- 
dent false predictions ([Osb16]), offers less diverse uncertainty estimates compared 
to (the equally simple and scalable) deep ensembles ({[LPB17], see Sect.7.1), and 
provides only rudimentary approximations of true posteriors. 

Relaxing these modeling assumptions and strengthening the Bayesian motivation 
of MC dropout is therefore an important research avenue. Further directions for future 
work are the development of semantic uncertainty mechanisms (e.g., [|KRPM+18]), 
improved local uncertainty calibrations, and a better understanding of the outlined 
sampling-free schemes to uncertainty estimation. 


6.3 Bayesian Neural Networks 


As the name suggests, Bayesian neural networks (BNNs) are inspired by a Bayesian 
interpretation of probability (for an introduction cf. [Mac03]). In essence, it rests on 
Bayes’ theorem, 


b 
sappara = pajp = PO. (2) 


stating that the conditional probability density function (PDF) p(a|b) for a given 
b may be expressed in terms of the inverted conditional PDF p(b|a). For machine 
learning, where one intends to make predictions y for unknown x given some training 
data D, this can be reformulated into 


y=NN(IO) with pop) = PrO (3) 
pD) 


Therein NN denotes a conventional (deep) neural network (DNN) with model param- 
eters 0, e.g., the set of weights and biases. In contrast to a regular DNN, the weights 
are given in terms of a probability distribution p(@|D) turning also the output y of 
a BNN into a distribution. This allows to study the mean u = (y!) of the DNN fora 
given x as well as higher moments of the distribution, typically the resulting variance 
a? = ((y — u°) is of interest, where 
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(y*) = I NN(x|)‘ p(O|D) d9. (4) 


While u yields the output of the network, the variance g? is a measure for the 


uncertainty of the model for the prediction at the given point. Central to this approach 
is the probability of the data given the model, here denoted by p(D|@), as it is the key 
component connecting model and training data. Typically, the prior distribution p(D) 
is “ignored” as it only appears as a normalization constant within the averages, see (4). 
In the cases where the data D is itself a distribution due to inherent uncertainty, i.e., 
presence of aleatoric risk [KG17], such a concept seems natural. However, Bayesian 
approaches are also applicable for all other cases. In those, loosely speaking, the 
likelihood of 8 is determined via the chosen loss function (for the connection between 
the two concepts cf. [Bis06]). 

On this general level, Bayesian approaches are broadly accepted and also find 
use for many other model classes besides neural networks. However, the loss sur- 
faces of DNNs are known for their high dimensionality and strong non-convexity. 
Typically, there are abundant parameter combinations @ that lead to (almost) equally 
good approximations to the training data D with respect to a chosen loss. This makes 
an evaluation of p(@|D) for DNNs close to impossible in full generality. At least no 
(exact) solutions for this case exist at the moment. Finding suitable approximations 
to the posterior distribution p(@|D) is an ongoing challenge for the construction of 
BNNs. At this point we only summarize two major research directions in the field. 
One approach is to assume that the distribution factorizes. While the full solution 
would be a joint distribution implying correlations between different weights (etc.), 
possibly even across layers, this approximation takes each element of 0 to be inde- 
pendent form the others. Although this is a strong assumption, it is often made, in 
this case, parameters for the respective distributions of each element can be learned 
via training (cf. [BCKW15]). The second class of approaches focuses on the region 
of the loss surface around the minimum chosen for the DNN. As discussed, the 
loss relates to the likelihood and quantities such as the curvature at the minimum, 
therefore directly connected to the distribution of 0. Unfortunately, already using 
this type of quantities requires further approximations [RBB18]. Alternatively, the 
convergence of the training process may be altered to sample networks close to 
the minimum [WT11]. While this approach contains information about correlations 
among the weights, it is usually restricted to a specific minimum. For a non-Bayesian 
approach taking into account several minima, see deep ensembles in Sect. 7.1. BNNs 
also touch other concepts such as MC dropout (cf. [GG16] and Sect.7.1), or prior 
networks, which are based on a Bayesian interpretation but use conventional DNNs 
with an additional (learned) ø output [MG18]. 
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6.4 Uncertainty Metrics for DNNs in Frequentist Inference 


Classical uncertainty quantification methods in frequentist inference are mostly based 
on the outputs of statistical models. Their uncertainty is quantified and assessed, 
for instance, via dispersion measures in classification (such as entropy, probabil- 
ity margin, or variation ratio), or confidence intervals in regression. However, the 
nature of DNN architectures [RDGF15, CPK+18] and the cutting-edge applications 
tackled by those (e.g., semantic segmentation, cf. [COR+16]) open the way toward 
more elaborate uncertainty quantification methods. Besides the mentioned classi- 
cal approaches, intermediate feature representations within a DNN (cf. [CZYS19, 
OSM19)) or gradients according to self-affirmation that represent re-learning stress 
(see [ORG18]) reveal additional information. In addition, in case of semantic seg- 
mentation, the geometry of a prediction may give access to further information, 
cf. [MRG20, RS19, RCH+20]. By the computation of statistics of those quanti- 
ties as well as low-dimensional representations thereof, we obtain more elaborate 
uncertainty quantification methods specifically designed for DNNs that can help us 
to detect misclassifications and out-of-distribution objects (cf. [HG17]). Features 
gripped during a forward pass of a data point x through a DNN F can be considered 
layer-wise, i.e., Fe(x) after the -th layer. These can be translated into a handful 
of quantities per layer [OSM19] or further processed by another DNN that aims at 
detecting errors [CZYS19]. While, in particular, [OSM19] presents a proof of con- 
cept on small-scale classification problems, their applicability to large-scale datasets 
and problems, such as semantic segmentation and object detection, remains open. 

The development for gradient-based uncertainty quantification methods [ORG18] 
is guided by one central question: if the present prediction was true, how much re- 
learning would this require. The corresponding hypothesis is that wrong predictions 
would be more in conflict with the knowledge encoded in the deep neural network 
than correct ones, therefore causing increased re-learning stress. Given a predicted 
class 


y = arg max F(x) (5) 
seS 


we compute the gradient of layer £ corresponding to the predicted label. That is, 
given a loss function J, we compute 


Ve J (9, X, 0) (6) 


via backpropagation. The obtained quantities can be treated similarly to the case 
of forward pass features. While this concept seems to be prohibitively expensive 
for semantic segmentation (at least when calculating gradients for each pixel of the 
multidimensional output y), its applicability to object detection might be feasible, in 
particular, with respect to offline applications. Gradients are also of special interest 
in active learning with query by expected model change (cf. [Set10]). 

In the context of semantic segmentation, geometrical information on segments’ 
shapes as well as neighborhood relations of predicted segments can be taken into 
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account alongside with dispersion measures. It has been demonstrated [RS19, 
MRG20, RCH+20] that the detection of errors in an in-distribution setting strongly 
benefits from geometrical information. Recently, this has also been considered in 
scenarios under moderate domain shift [ORF20]. However, its applicability to out- 
of-distribution examples and to other sensors than the camera remains subject to 
further research. 

The problem of realistically quantifying uncertainty measures will be taken up 
in Chapter “Uncertainty Quantification for Object Detection: Output- and Gradi- 
ent-based Approaches” [RSKR22]. Here output-based and learning-gradient-based 
uncertainty metrics for object detection will be examined, showing that they are 
non-correlated. Thus, a combination of both paradigms leads, as is demonstrated, 
to a better object detection uncertainty estimate and by extension to a better overall 
detection accuracy. 


6.5 Markov Random Fields 


Although deep neural networks are currently the state of the art for almost all com- 
puter vision tasks, Markov random fields (MRF) remain one of the fundamental 
techniques used for many computer vision tasks, specifically image segmentation 
[LWZ09, KK11]. MRFs hold its power in the essence of being able to model depen- 
dencies between pixels in an image. With the use of energy functions, MRFs integrate 
pixels into models relating between unary and pair-wise pixels together [WKP13]. 
Given the model, MRFs are used to infer the optimal configuration yielding the 
lowest energy using mainly maximum a posteriori (MAP) techniques. Several MAP 
inference approaches are used to yield the optimal configuration such as graph cuts 
[KRBTO08] and belief propagation algorithms [FZ10]. However, as with neural net- 
works, MAP inference techniques result in deterministic point estimates of the opti- 
mal configuration without any sense of uncertainty in the output. To obtain uncertain- 
ties on results from MRFs, most of the work is directed toward modeling MRFs with 
Gaussian distributions. Getting uncertainties from MRFs with Gaussian distributions 
is possible by two typical methods: either approximate models are inferred to the pos- 
terior, from which sampling is easy or the variances can be estimated analytically, or 
approximate sampling from the posterior is used. Approximate models include those 
inferred using variational Bayesian (VB) methods, like mean-field approximations, 
and using Gaussian process (GP) models enforcing a simplified prior model [Bis06, 
LUAD 16]. Examples of approximate sampling methods include traditional Markov 
chain Monte Carlo (MCMC) methods like Gibbs sampling [GG84]. Some recent 
theoretical advances propose the perturb-and-MAP framework and a Gumbel per- 
turbation model (GPM) [PY 11, HMJ13] to exactly sample from MRF distributions. 
Another line of work has also been proposed, where MAP inference techniques 
are used to estimate the probability of the network output. With the use of graph 
cuts, [KT08] try to estimate uncertainty using the min-marginals associated with 
the label assignments of a random field. Here, the work by Kohli and Torr [KT08] 
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was extended to show how this approach can be extended to techniques other than 
graph cuts [TA12] or compute uncertainties on multi-label marginal distributions 
[STP17]. A current research direction is the incorporation of MRFs with deep neural 
networks, along with providing uncertainties on the output [SU15, CPK+18]. This 
can also be extended to other forms of neural networks such as recurrent neural net- 
works to provide uncertainties on segmentation of streams of videos with extending 
dependencies of pixels to previous frames [ZJRP+15, LLL+17]. 


6.6 Confidence Calibration 


Neural network classifiers output a label YeYona given input x € D with an 
associated confidence P. This confidence can be interpreted as a probability of cor- 
rectness that the predicted label matches the ground-truth label Y € V. Therefore, 
these probabilities should reflect the “self-confidence” of the system. If the empirical 
accuracy for any confidence level matches the predicted confidence, a model is called 
well calibrated. Therefore, a classification model is perfectly calibrated if 


PY=Y|P=p)= p  VYpe(0,1] (7) 
—— Say’ 
accuracy given p confidence 


is fulfilled [GPSW17]. For example, assume 100 predictions with confidence values 
of 0.9. We call the model well calibrated if 90 out of these 100 predictions are actually 
correct. However, recent work has shown that modern neural networks tend to be too 
overconfident in their predictions [GPSW 17]. The deviation of a model to the perfect 
calibration can be measured by the expected calibration error (ECE) [NCH15]. It is 
possible to recalibrate models as a post-processing step after classification. One way 
to get a calibration mapping is to group all predictions into several bins by their 
confidence. Using such a binning scheme, it is possible to compute the empirical 
accuracy for certain confidence levels, as it is known for a long time already in 
reconstructing confidence outputs for Viterbi decoding [HR90]. Common methods 
are histogram binning [ZE01], isotonic regression [ZE02], or more advanced methods 
like Bayesian binning into quantiles (BBQ) [NCH15] and ensembles of near-isotonic 
regression (ENIR) [NC16]. Another way to get a calibration mapping is to use scaling 
methods based on logistic regression like Platt scaling [Pla99], temperature scaling 
[GPSW17], and beta calibration [KSFF17]. 

In the setting of probabilistic regression, a model is calibrated if 95% of the true tar- 
get values are below or equal to a credible level of 95% (so-called quantile-calibrated 
regression) [GBRO7, KFE18, SDKF19]. A regression model is usually calibrated 
by fine-tuning its predicted CDF in a post-processing step to match the empirical 
frequency. Common approaches utilize isotonic regression [KFE18], logistic and 
beta calibration [SKF18], as well as Gaussian process models [SKF18, SDKF19] to 
build a calibration mapping. In contrast to quantile-calibrated regression, [SDKF19] 
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have recently introduced the concept of distribution calibration, where calibration 
is applied on a distribution level and naturally leads to calibrated quantiles. Recent 
work has shown that miscalibration in the scope of object detection also depends on 
the position and scale of a detected object [KKSH20]. The additional box regression 
output is denoted by R with A as the size of the used box encoding. Furthermore, 
if we have no knowledge about all anchors of a model (which is a common case in 
many applications), it is not possible to determine the accuracy. Therefore, Ktippers 
et al. [KKSH20] use the precision as a surrogate for accuracy and propose that an 
object detection model is perfectly calibrated if 


P(M=1|P=p,¥=y,R=r)= p  Ype[0,l],yey¥,reR* (8) 
————— —— 
precision given p,y,r confidence 


is fulfilled, where M = 1 denotes a correct prediction that matches a ground-truth 
object with a chosen IoU threshold and M = 0 denotes a mismatch, respectively. The 
authors propose the detection-expected calibration error (D-ECE) as the extension 
of the ECE to object detection tasks in order to measure miscalibration also by 
means of the position and scale of detected objects. Other approaches try to fine-tune 
the regression output in order to obtain more reliable object proposals [JLM+18, 
RTG+19] or to add a regularization term to the training objective such that training 
yields models that are both well performing and well calibrated [PTC+17, SSH19]. 

In Chapter “Confidence Calibration for Object Detection and Segmentation” 
[KHKS22], an extension of the ECE for general multivariate learning problems, such 
as object detection or image segmentation, will be discussed. In fact, the proposed 
multivariate confidence calibration is designed to take additional information into 
account, such as bounding boxes or shape descriptors. It is shown that this extended 
calibration reduces the calibration error as expected but also has a positive bearing 
on the quality of segmentation masks. 


7 Aggregation 


From a high-level perspective, a neural network is based on processing inputs and 
coming to some output conclusion, e.g., mapping incoming image data onto class 
labels. Aggregation or collection of non-independent information on either the input 
or output side of this network function can be used as a tool to leverage its performance 
and reliability. Starting with the input, any additional “dimension” to add data can be 
of use. For example, in the context of automated vehicles, this might be input from 
any further sensor measuring the same scene as the original one, e.g., stereo cameras 
or LiDAR. Combining those sensor sets for prediction is commonly referred to as 
sensor fusion [CBSW19]. Staying with the example, the scene will be monitored 
consecutively providing a whole (temporally ordered) stream of input information. 
This may be used either by adjusting the network for this kind of input [KLX+17] or 
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in terms of a post-processing step, in which the predictions are aggregated by some 
measure of temporal consistency. 

Another more implicit form of aggregation is training the neural network on 
several “independent” tasks, e.g., segmentation and depth regression. Although the 
individual task is executed on the same input, the overall performance can still benefit 
from the correlation among all given tasks. We refer to the discussion on multi-task 
networks in Sect. 9.2. By extension, solving the same task in multiple different ways 
can be beneficial for performance and provide a measure of redundancy. In this 
survey, we focus on single-task systems and discuss ensemble methods in the next 
section and the use of temporal consistency in the one thereafter. 


7.1 Ensemble Methods 


Training a neural network is optimizing its parameters to fit a given training dataset. 
The commonly used gradient-based optimization schemes cause convergence in a 
“nearby” local minimum. As the loss landscapes of neural networks are notoriously 
non-convex [CHM+15], various locally optimal model parameter sets exist. These 
local optima differ in the degree of optimality (“deepness’’), qualitative characteristics 
(“optimal for different parts of the training data”), and their generalizability to unseen 
data (commonly referred to by the geometrical terms of “sharpness” and “flatness” 
of minima [KMN+17]). 

A single trained network corresponds to one local minimum of such a loss land- 
scape and thus captures only a small part of a potentially diverse set of solutions. 
Network ensembles are collections of models and therefore better suited to reflect 
this multi-modality. Various modeling choices shape a loss landscape: the selected 
model class and its meta-parameters (like architecture and layer width), the training 
data and the optimization objective. Accordingly, approaches to diversify ensemble 
components range from combinations of different model classes over varying train- 
ing data (bagging) to methods that train and weight ensemble components to make 
up for the flaws of other ensemble members (boosting) [Bis06]. 

Given the millions of parameters of application-size networks, ensembles of NNs 
are resource-demanding w.r.t. computational load, storage, and runtime during train- 
ing and inference. This complexity increases linearly with ensemble size for naïve 
ensembling. Several approaches were put forward to reduce some dimensions of this 
complexity: snapshot ensembles [HLP+17] require only one model optimization 
with a cyclical learning-rate schedule leading to an optimized training runtime. The 
resulting training trajectory passes through several local minima. The corresponding 
models compose the ensemble. On the contrary, model distillation [HVD 14] tackles 
runtime at inference. They “squeeze” an NN ensemble into a single model that is 
optimized to capture the gist of the model set. However, such a compression goes 
along with reduced performance compared to the original ensemble. 

Several hybrids of single model and model ensemble exist: multi-head networks 
[AJD 18] share a backbone network that provides inputs to multiple prediction net- 
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works. Another variant is mixture-of-expert model that utilizes a gating network to 
assign inputs to specialized expert networks [SMM+17]. Multi-task networks (cf. 
Sect.9.2) and Bayesian approximations of NNs (cf. Sects. 6.2 and 6.3) can be seen 
as implicit ensembles. 

NN ensembles (or deep ensembles) are not only used to boost model quality. 
They pose the frequentist’s approach to estimating NN uncertainties and are state 
of the art in this regard [LPB17, SOF+19]. The emergent field of federated learn- 
ing is concerned with the integration of decentrally trained ensemble components 
[MMR+17] and safety-relevant applications of ensembling range from automated 
driving [Zhal2] to medical diagnostics [RRMH17]. Taking this safe-ML perspec- 
tive, promising research directions comprise a more principled and efficient compo- 
sition of model ensembles, by application-driven diversification as well as improved 
techniques to miniaturize ensembles and by gaining a better understanding of meth- 
ods like model distillation. In the long run, better designed, more powerful learning 
systems might partially reduce the need for combining weaker models in a network 
ensemble. 

In Chapter “Evaluating Mixture-of-Expert Architectures for Network Aggrega- 
tion” [PHW22], the advantages and drawbacks of mixture-of-experts architectures 
with respect to robustness, interpretability, and overall test performance will be dis- 
cussed. Two state-of-the-art architectures are investigated, obtaining, sometimes sur- 
passing, baseline performance. What is more, the models exhibit improved awareness 
for OoD data. 


7.2 Temporal Consistency 


The focus of previous DNN development for semantic segmentation has been on 
single-image prediction. This means that the final and intermediate results of the 
DNN are discarded after each image. However, the application of a computer vision 
model often involves the processing of images in a sequence, i.e., there is a temporal 
consistency in the image content between consecutive frames (for a metric, cf., 
e.g., [VBB+20]). This consistency has been exploited in previous work to increase 
quality and reduce computing effort. Furthermore, this approach offers the potential 
to improve the robustness of DNN prediction by incorporating this consistency as 
a priori knowledge into DNN development. The relevant work in the field of video 
prediction can be divided into two major approaches: 


1. DNNs are specially designed for video prediction. This usually requires training 
from scratch and the availability of training data in a sequence. 

2. A transformation from single prediction DNNs to video prediction DNNs takes 
place. Usually no training is required, i.e., the existing weights of the model can 
be used unaltered. 


The first set of approaches often involves conditional random fields (CRF) and its 
variants. CRFs are known for their use as post-processing step in the prediction of 
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semantic segmentation, in which their parameters are learned separately or jointly 
with the DNN [ZJRP+15]. Another way to use spatiotemporal features is to include 
3D convolutions, which add an additional dimension to the conventional 2D convo- 
lutional layer. Tran et al. [TBF+15] use 3D convolution layers for video recognition 
tasks, such as action and object recognition. One further approach to use spatial and 
temporal characteristics of the input data is to integrate long short-term memory 
(LSTM) [HS97], a variant of the recurrent neural network (RNN). Fayyaz et al. 
[FSS+16] integrate LSTM layers between the encoder and decoder of their convo- 
lutional neural network for semantic segmentation. The significantly higher GPU 
memory requirements and computational effort are disadvantages of this method. 
More recently, Nilsson and Sminchisescu [NS18] deployed gated recurrent units, 
which generally require significantly less memory. An approach to improve tem- 
poral consistency of automatic speech recognition outputs is known as a posterior- 
in-posterior-out (PIPO) LSTM “sequence enhancer’, a postfilter which could be 
applicable to video processing as well [LSFs19]. A disadvantage of the described 
methods is that sequential data for training must be available, which may be limited 
or show a lack of diversity. 

The second class of approaches has the advantage that it is model independent 
most of the time. Shelhamer et al. [SRHD16] found that the deep feature maps 
within the network change only slightly with temporal changes in video content. 
Accordingly, [GJG17] calculate the optical flow of the input images from time steps 
t = 0 and t = —1 and convert it into the so-called transform flow which is used to 
transform the feature maps of the time step t = —1 so that an aligned representation 
to the feature map t = 0 is achieved. Sämann et al. [SAMG19] use a confidence- 
based combination of feature maps from previous time steps based on the calculated 
optical flow. 


8 Verification and Validation 


Verification and validation is an integral part of the safety assurance for any safety- 
critical systems. As of the functional safety standard for automotive systems [ISO18], 
verification means to determine whether given requirements are met [ISO18, 3.180], 
such as performance goals. Validation, on the other side, tries to assess whether the 
given requirements are sufficient and adequate to guarantee safety [ISO18, 3.148], 
e.g., whether certain types of failures or interactions simply were overlooked. The 
latter is usually achieved via extensive testing in real operation conditions of the 
integrated product. This differs from the notion of validation used in the machine 
learning community in which it usually refers to simple performance tests on a 
selected dataset. In this section, we want to concentrate on general verification aspects 
for deep neural networks. 

Verification as in the safety domain encompasses (manual) inspection and analysis 
activities, and testing. However, the contribution of single processing steps within 
a neural network to the final behavior can hardly be assessed manually (compared 
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to the problem of interpretability in Sect.5). Therefore, we here will concentrate 
on different approaches to verification testing, i.e., formal testing and black-box 
methods. 

The discussion on how to construct an assurance case and how to derive suitable 
acceptance criteria for ML models in the scope of autonomous driving, as well as 
how to determine and acquire sufficient evidences for such an assurance case is pro- 
vided in Chapter “Safety Assurance of Machine Learning for Perception Functions” 
[BHH+22]. The authors look into several use cases on how to systematically obtain 
evidences and incorporate them into a structured safety argumentation. 

In Chapter “A Variational Deep Synthesis Approach for Perception Validation” 
[GHS22], a novel data generation framework is introduced, aiming at validating per- 
ception functions, in particular, in the context of autonomous driving. The framework 
allows for effectively defining and testing common traffic scenes. Experiments are 
performed for pedestrian detection and segmentation in versatile urban scenarios. 


8.1 Formal Testing 


Formal testing refers to testing methods that include formalized and formally ver- 
ifiable steps, e.g., for test data acquisition, or for verification in the local vicinity 
of test samples. For image data, local testing around valid samples is usually more 
practical than fully formal verification: (safety) properties are not expected to hold 
on the complete input space but only on the much smaller unknown lower dimen- 
sional manifold of real images [WGH19]. Sources of such samples can be real ones 
or generated ones using an input space formalization or a trained generative model. 

Coverage criteria for the data samples are commonly used for two purposes: (a) 
deciding when to stop testing or (b) identifying missing tests. For CNNs, there are 
at least three different approaches toward coverage: (1) approaches that establish 
coverage based on a model with semantic features of the input space [GHHW20], 
(2) approaches trying to semantically cover the latent feature space of neural net- 
work or a proxy network (e.g., an autoencoder) [SS20a], and (3) approaches trying 
to cover neurons and their interactions, inspired by classical software white-box 
analysis [PCYJ19, SHK+19]. 

Typical types of properties to verify are simple test performance, local stability 
(robustness), a specific structure of the latent spaces like embedding of semantic 
concepts [SS20b], and more complex logical constraints on inputs and outputs, which 
can be used for testing when fuzzified [RDG18]. Most of these properties require 
in-depth semantic information about the DNN inner workings, which is often only 
available via interpreting intermediate representations [KWG+18], or interpretable 
proxies/surrogates (cf. Sect.5.4), which do not guarantee fidelity. 

There exist different testing and formal verification methods from classical soft- 
ware engineering that have already been applied to CNNs. Differential testing as 
used by DeepXPlore [PCYJ19] trains n different CNNs for the same task using inde- 
pendent datasets and compares the individual prediction results on a test set. This 


Inspect, Understand, Overcome: A Survey of Practical Methods for AI Safety 41 


allows to identify inconsistencies between the CNNs but no common weak spots. 
Data augmentation techniques start from a given dataset and generate additional 
transformed data. Generic data augmentation for images like rotations and transla- 
tion is state of the art for training but may also be used for testing. Concolic testing 
approaches incrementally grow test suites with respect to a coverage model to finally 
achieve completeness. Sun et al. [SWR+18] use an adversarial input model based on 
some norm (cf. Sect. 4.1), e.g., an L p-norm, for generating additional images around 
a given image using concolic testing. Fuzzing generates new test data constrained 
by an input model and tries to identify interesting test cases, e.g., by optimizing 
white-box coverage mentioned above [OOAG19]. Fuzzing techniques may also be 
combined with the differential testing approach discussed above [GJZ+18]. In all 
these cases, it needs to be ensured that the image as well as its meta-data remains 
valid for testing after transformation. Finally, proving methods surveyed by Liu et al. 
[LAL+20] try to formally prove properties on a trained neural network, e.g., based on 
satisfiability modulo theories (SMT). These approaches require a formal characteri- 
zation of an input space and the property to be checked, which is hard for non-trivial 
properties like contents of an image. 

Existing formal testing approaches can be quite costly to integrate into testing 
workflows: differential testing and data augmentation require several inferences per 
initial test sample; concolic and fuzzy testing apply an optimization to each given 
test sample, while convergence toward the coverage goals is not guaranteed; also, 
the iterative approaches need tight integration into the testing workflow; and lastly, 
proving methods usually have to balance computational efficiency against the pre- 
cision or completeness of the result [LAL+20]. Another challenge of formal testing 
is that machine learning applications usually solve problems for which no (formal) 
specification is possible. This makes it hard to find useful requirements for testing 
[ZHML20] and properties that can be formally verified. Even partial requirements, 
such as specification of useful input perturbations, specified corner cases, and valu- 
able coverage goals, are typically difficult to identify [SS20a, BTLFs20]. 


8.2 Black-Box Methods 


In machine learning literature, neural networks are often referred to as black boxes 
due to the fact that their internal operations and their decision-making are not com- 
pletely understood [SZT17], hinting at a lack of interpretability and transparency. 
However, in this survey, we consider a black box to be a machine learning model to 
which we only have Oracle (query) access [TZJ+16, PMG+17]. That means we can 
query the model to get input—output pairs, but we do not have access to the specific 
architecture (or weights, in case of neural networks). As [OAFS18] describes, black 
boxes are increasingly widespread, e.g., health care, autonomous driving, or ML as 
a service in general, due to proprietary, privacy, or security reasons. As deploying 
black boxes gains popularity, so do methods that aim to extract internal information, 
such as architecture and parameters, or to find out, whether a sample belongs to the 
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training dataset. These include model extraction attacks [TZJ+16, KTP+20], mem- 
bership inference attacks [SSSS17], general attempts to reverse-engineer the model 
[OAFS 18], or to attack it adversarially [PMG+17]. Protection and counter-measures 
are also actively researched: Kesarwani et al. [K MAM18] propose a warning system 
that estimates how much information an attacker could have gained from queries. 
Adi et al. [ABC+18] use watermarks for models to prevent illegal re-distribution and 
to identify intellectual property. 

Many papers in these fields make use of so-called surrogate, avatar, or proxy 
models that are trained on input-output pairs of the black box. In case the black- 
box output is available in soft form (e.g., logits), distillation as first proposed by 
[HVD14] can be applied to train the surrogate (student) model. Then, any white-box 
analysis can be performed on the surrogates (cf. e.g., [PMG+17]) to craft adversarial 
attacks targeted at the black box. More generally, (local) surrogates, as, for example, 
in [RSG16], can be used to (locally) explain its decision-making. Moreover, these 
techniques are also of interest if one wants to compare or test black-box models (cf. 
Sect.8.1). This is the case, among others, in ML marketplaces, where you wish to 
buy a pre-trained model [ABC+18], or if you want to verify or audit that a third-party 
black-box model obeys regulatory rules (cf. [CH19]). 

Another topic of active research is so-called observers. The concept of observers 
is to evaluate the interface of a black-box module to determine if it behaves as 
expected within a given set of parameters. The approaches can be divided into 
model-explaining and anomaly-detecting observers. First, model explanation meth- 
ods answer the question of which input characteristic is responsible for changes at 
the output. The observer is able to alter the inputs for this purpose. If the input of the 
model under test evolves only slightly but the output changes drastically, this can be a 
signal that the neural network is misled, which is also strongly related to adversarial 
examples (cf. Sect. 4). Hence, the reason for changes in the classification result via the 
input can be very important. In order to figure out in which region of an input image 
the main reason for the classification is located, [FV17] “delete” information from 
the image by replacing regions with generated patches until the output changes. This 
replaced region is likely responsible for the decision of the neural network. Building 
upon this, [UEKH19] adapt the approach to medical images and generate “deleted” 
regions by a variational autoencoder (VAE). Second, anomaly-detecting observers 
register input and output anomalies, either examining input and output independently 
or as an input—output pair, and predict the black-box performance in the current sit- 
uation. In contrast to model-explaining approaches, this set of approaches has high 
potential to be used in an online scenario since it does not need to modify the model 
input. The maximum mean discrepancy (MMD) [BGR+06] measures the domain gap 
between two data distributions independently from the application and can be used to 
raise a warning if input or output distributions during inference deviate too strongly 
from their respective training distributions. Another approach is using a GAN-based 
autoencoder [LFK+20] to perform a domain shift estimation where the Wasserstein 
distance is used as domain mismatch metric. This metric can also be evaluated by 
use of a casual time-variant aggregation of distributions during inference time. 
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9 Architecture 


In order to solve a specific task, the architecture of a CNN and its building blocks 
play a significant role. Since the early days of using CNNs in image processing, when 
they were applied to handwriting recognition [LBD+89] and the later breakthrough in 
general image classification [KSH12], the architecture of the networks has changed 
radically. Did the term of deep learning for these first convolutional neural networks 
imply a depth of approximately four layers, their depth increased significantly during 
the last years and new techniques had to be developed to successfully train and 
utilize these networks [HZRS16]. In this context, new activation functions [RZL18] 
as well as new loss functions [LGG+17] have been designed and new optimization 
algorithms [KB 15] were investigated. 

With regard to the layer architecture, the initially alternating repetition of convo- 
lution and pooling layers as well as their characteristics have changed significantly. 
The convolution layers made the transition from a few layers with often large filters 
to many layers with small filters. A further trend was then the definition of entire 
modules, which were used repeatedly within the overall architecture as so-called 
network in network [LCY 14]. 

In areas such as automated driving, there is also a strong interest in the simulta- 
neous execution of different tasks within one single convolutional neural network 
architecture. This kind of architecture is called multi-task learning (MTL) [Car97] 
and can be utilized in order to save computational resources and at the same time 
to increase performance of each task [KTMFs20]. Within such multi-task networks, 
usually one shared feature extraction part is followed by one separate so-called head 
per task [TWZ+18]. 

In each of these architectures, manual design using expert knowledge plays a 
major role. The role of the expert is the crucial point here. In recent years, however, 
there have also been great efforts to automate the process of finding architectures for 
networks or, in the best case, to learn them. This is known under the name neural 
architecture search (NAS). 


9.1 Building Blocks 


Designing a convolutional neural network typically includes a number of design 
choices. The general architecture usually contains a number of convolutional and 
pooling layers which are arranged in a certain pattern. Convolutional layers are 
commonly followed by a nonlinear activation function. The learning process is based 
on a loss function which determines the current error and an optimization function 
that propagates the error back to the single convolution layers and its learnable 
parameters. 

When CNNs became state of the art in computer vision [KSH12], they were 
usually built using a few alternating convolutional and pooling layers having a few 
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fully connected layers in the end. It turned out that better results are achieved with 
deeper networks and so the number of layers increased [SZ15] over the years. To 
deal with these deeper networks, new architectures had to be developed. In a first 
step, to reduce the number of parameters, the convolutional layers with partly large 
filter kernels were replaced by several layers with small 3 x 3 kernels. Today, most 
architectures are based on the network-in-network principle [LCY 14], where more 
complex modules are used repeatedly. Examples of such modules are the inception 
module from GoogleNet [SLJ+15] or the residual block from ResNet [HZRS 16]. 
While the inception module consists of multiple parallel strings of layers, the residual 
blocks are based on the highway network [SGS 15], which means that they can bypass 
the original information and the layers in between are just learning residuals. With 
ResNeXt [XGD+17] and Inception-ResNet [SIVA17] there already exist two 
networks that combine both approaches. For most tasks, it turned out that replacing 
the fully connected layers by convolutional layers is much more convenient mak- 
ing the networks fully convolutional [LSD15]. These so-called fully convolutional 
networks (FCN) are no longer bound to fixed input dimensions. Note that with the 
availability of convolutional long short-term memory (ConvLSTM) structures also 
fully convolutional recurrent neural networks (FCRNs) became available for fully 
scalable sequence-based tasks [SCW+15, SDF+20]. 

Inside the CNNs, the rectified linear unit (ReLU) has been the most frequently 
used activation function for a long time. However, since this function suffers from 
problems related to the mapping of all negative values to zero like the vanishing 
gradient problem, new functions have been introduced in recent years. Examples are 
the exponential linear unit (ELU), swish [RZL18], and the non-parametric linearly 
scaled hyperbolic tangent (LiSHT) [RMDC19]. In order to be able to train a network 
consisting of these different building blocks, the loss function is the most crucial 
part. This function is responsible for how and what the network ultimately learns 
and how exactly the training data is applied during the training process to make 
the network train faster or perform better. So the different classes can be weighted 
in a classification network with fixed values or so-called a-balancing according to 
their probability of occurrence. Another interesting approach is weighting training 
examples according to their easiness for the current network [LGG+17, WFZ19]. 
For multi-task learning also weighting tasks based on their uncertainty [KGC18] or 
gradients [CBLR18] can be done as further explained in Sect.9.2. A closer look on 
how a modification of the loss function might affect safety-related aspects is given 
in Sect. 3, Modification of Loss. 


9.2 Multi-task Networks 


Multi-task learning (MTL) in the context of neural networks describes the process of 
optimizing several tasks simultaneously by learning a unified feature representation 
[Car97, RBV18, GHL+20, KBFs20] and coupling the task-specific loss contribu- 
tions, thereby enforcing cross-task consistency [CPMA19, LYW+19, KTMFs20]. 
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Unified feature representation is usually implemented by sharing the parameters 
of the initial layers inside the encoder (also called feature extractor). It not only 
improves the single tasks by more generalized learned features but also reduces the 
demand for computational resources at inference. Not an entirely new network has 
to be added for each task but only a task-specific decoder head. It is essential to 
consider the growing amount of visual perception tasks in automated driving, e.g., 
depth estimation, semantic segmentation, motion segmentation, and object detection. 
While the parameter sharing can be soft, as in cross stitch [MSGH16] and sluice 
networks [RBAS18], or hard [Kok17, TWZ+18], meaning ultimately sharing the 
parameters, the latter is usually preferred due to its straightforward implementation 
and lower computational complexity during training and inference. 

Compared to implicitly coupling tasks via a shared feature representation, there 
are often more direct ways to optimize the tasks inside cross-task losses jointly. It 
is only made possible as, during MTL, there are network predictions for several 
tasks, which can be enforced to be consistent. As an example, sharp depth edges 
should only be at class boundaries of semantic segmentation predictions. Often both 
approaches to MTL are applied simultaneously [YZS+18, CLLW19] to improve a 
neural network’s performance as well as to reduce its computational complexity at 
inference. 

While the theoretical expectations for MTL are quite clear, it is often challenging 
to find a good weighting strategy for all the different loss contributions as there is no 
theoretical basis on which one could choose such a weighting with early approaches 
either involving heuristics or extensive hyperparameter tuning. The easiest way to 
balance the tasks is to use uniform weight across all tasks. However, the losses 
from different tasks usually have different scales, and uniformly averaging them 
suppresses the gradient from tasks with smaller losses. Addressing these problems, 
Kendall et al. [KGC18] propose to weigh the loss functions by the homoscedastic 
uncertainty of each task. One does not need to tune the weighting parameters of 
the loss functions by hand, but they are adapted automatically during the training 
process. Concurrently Chen et al. [CBLR18] propose GradNorm, which does not 
explicitly weigh the loss functions of different tasks but automatically adapts the 
gradient magnitudes coming from the task-specific network parts on the backward 
pass. Liu et al. [LJD19] proposed dynamic weight average (DWA), which uses an 
average of task losses over time to weigh the task losses. 


9.3 Neural Architecture Search 


In the previous sections, we saw manually engineered modifications of existing CNN 
architectures proposed by ResNet [HZRS16] or Inception [SLJ+15]. They are 
results of human design and showed their ability to improve performance. ResNet 
introduces a skip connection in building blocks and Inception makes use of 
its specific inception module. Hereby, the intervention by an expert is crucial. The 
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approach of neural architecture search (NAS) aims to automate this time-consuming 
and manual design of neural network architectures. 

NAS is closely related to hyperparameter optimization (HO), which is described 
in Sect.3, hyperparameter optimization. Originally, both tasks were solved simul- 
taneously. Consequently, the kernel size or number of filters were seen as addi- 
tional hyperparameters. Nowadays, the distinction between HO and NAS should be 
stressed. The concatenation of complex building blocks or modules cannot be accu- 
rately described with single parameters. This simplification is no longer suitable. 

To describe the NAS process, the authors of [EMH19b] define three steps: (1) 
definition of search space, (2) search strategy, and (3) performance estimation strat- 
egy. 

Majority of search strategies take advantage of the NASNet search space [ZVSL18] 
which arranges various operations, e.g., convolution, pooling within a single cell. 
However, other spaces based on a chain or multi-branch structure are possible 
[EMH19b]. The search strategy comprises advanced methods from sequential model- 
based optimization (SMBO) [LZN+18], Bayesian optimization [KNS+18], 
evolutionary algorithms [RAHL19, EMH19a], reinforcement learning [ZVSL18, 
PGZ+18], and gradient descent [LSY19, SDW+19]. Finally, the performance esti- 
mation describes approximation techniques due to the impracticability of multiple 
evaluation runs. For a comprehensive survey regarding the NAS process, we refer 
to [EMH19b]. Recent research has shown that reinforcement learning approaches 
such as NASNet-A [ZVSL18] and ENAS [PGZ+18] are partly outperformed by evo- 
lutionary algorithms, e.g., AmoebaNet [RAHL19] and gradient-based approaches, 
e.g., DARTS [LSY19]. 

Each of these approaches focuses on different optimization aspects. Gradient- 
based methods are applied to a continuous search space and offer faster optimization. 
On the contrary, the evolutionary approach LEMONADE [EMH 19a] enables multi- 
object optimization by considering the conjunction of resource consumption and 
performance as the two main objectives. Furthermore, single-path NAS [SDW+19] 
extends the multi-path approach of former gradient-based methods and proposes 
the integration of “over-parameterized superkernels”, which significantly reduces 
memory consumption. 

The focus of NAS is on the optimized combination of humanly predefined CNN 
elements with respect to objectives such as resource consumption and performance. 
NAS offers automation, however, the realization of the objectives is strongly limited 
by the potential of the CNN elements. 


10 Model Compression 


Recent developments in CNNs have resulted in neural networks being the state of the 
art in computer vision tasks like image classification [KSH12, HZRS15, MGR+18], 
object detection [Girl15, RDGF15, HGDG17], and semantic segmentation [CPSA17, 
ZSR+19, LBS+19, WSC+20]. This is largely due to the increasing availability of 
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hardware computational power and an increasing amount of training data. We also 
observe a general upward trend of the complexities of the neural networks along with 
their improvement in state-of-the-art performance. These CNNs are largely trained 
on back-end servers with significantly higher computing capabilities. The use of 
these CNNs in real-time applications is inhibited due to the restrictions on hardware, 
model size, inference time, and energy consumption. This led to an emergence of 
a new field in machine learning, commonly termed as model compression. Model 
compression basically implies reducing the memory requirements, inference times, 
and model size of DNNs to eventually enable the use of neural networks on edge 
devices. This is tackled by different approaches, such as network pruning (identifying 
weights or filters that are not critical for network performance), weight quantizations 
(reducing the precision of the weights used in the network), knowledge distillation 
(a smaller network is trained with the knowledge gained by a bigger network), and 
low-rank factorization (decomposing a tensor into multiple smaller tensors). In this 
section, we introduce some of these methods for model compression and discuss in 
brief the current open challenges and possible research directions with respect to its 
use in automated driving applications. 

In Chapter “Joint Optimization for DNN Model Compression and Corruption 
Robustness” [VHB+22], the concept of both pruning and quantization on the appli- 
cation of semantic segmentation in autonomous driving is elucidated, while at the 
same time robustifying the model against corruptions in the input data. An improved 
performance and robustness for a state-of-the-art model for semantic segmentation 
is demonstrated. 


10.1 Pruning 


Pruning has been used as a systematic tool to reduce the complexity of deep neural 
networks. The redundancy in DNNs may exist on various levels, such as the indi- 
vidual weights, filters, and even layers. All the different methods for pruning try to 
take advantage of these available redundancies on various levels. Two of the initial 
approaches for neural networks proposed weight pruning in the 1990s as a way of 
systematically damaging neural networks [CDS90, Ree93]. As these weight prun- 
ing approaches do not aim at changing the structure of the neural network, these 
approaches are called unstructured pruning. Although there is reduction in the size 
of the network when it is saved in sparse format, the acceleration depends on the 
availability of hardware that facilitates sparse multiplications. As pruning filters and 
complete layers aim at exploiting the available redundancy in the architecture or 
structure of neural networks, these pruning approaches are called structured prun- 
ing. Pruning approaches can also be broadly classified into: data-dependent and 
data-independent methods. Data-dependent methods [LLS+17, LWL17, HZS17] 
make use of the training data to identify filters to prune. Theis et al. [TKTH18] and 
Molchanov et al. [MTK+17] propose a greedy pruning strategy that identifies the 
importance of feature maps one at a time from the network and measures the effect 
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of removal of the filters on the training loss. This means that filters corresponding to 
those feature maps that have least effect on training loss are removed from the net- 
work. Within data-independent methods [LKD+17, HKD+18, YLLW18, ZQF+18], 
the selection of CNN filters to be pruned are based on the statistics of the filter values. 
Li et al. [LKD+17] proposed a straightforward method to calculate the rank of filters 
in a CNN. The selection of filters are based on the L-norm, where the filter with the 
lowest norm is pruned away. He et al. [HZS17] employ a least absolute shrinkage 
and selection operator (LASSO) regression-based selection of filters to minimize the 
least squares reconstruction. 

Although the abovementioned approaches demonstrated that a neural network 
can be compressed without affecting the accuracy, the effect on robustness is largely 
unstudied. Dhillon et al. [DAL+18] proposed pruning a subset of activations and 
scaling up the survivors to show improved adversarial robustness of a network. Lin 
et al. [LGH19] quantize the precision of the weights after controlling the Lipschitz 
constant of layers. This restricts the error propagation property of adversarial per- 
turbations within the neural network. Ye et al. [YXL+19] evaluated the relationship 
between adversarial robustness and model compression in detail and show that naive 
compression has a negative effect on robustness. Gui et al. [GW Y+19] co-optimize 
robustness and compression constraints during the training phase and demonstrate 
improvement in the robustness along with reduction in the model size. However, 
these approaches have mostly been tested on image classification tasks and on smaller 
datasets only. Their effectiveness on safety-relevant automated driving tasks, such 
as object detection and semantic segmentation, is not studied and remains an open 
research challenge. 


10.2 Quantization 


Quantization of a random variable x having a probability density function p(x) is the 
process of dividing the range of x into intervals, each is represented using a single 
value (also called reconstruction value), such that the following reconstruction error 
is minimized: 

I Dist 
Df a-d, 0) 


i=l “bi 


where b; is the left-side border of the i-th interval, q; is its reconstruction value, and 
I is the number of intervals, e.g., Z = 8 for a 3-bit quantization. This definition can 
be extended to multiple dimensions as well. 

Quantization of neural networks has been around since the 1990s [Guo18], how- 
ever, with a focus in the early days on improving the hardware implementations of 
these networks. In the deep learning literature, a remarkable application of quanti- 
zation combined with unstructured pruning can be found in the approach of deep 
compression [HMD16], where one-dimensional k-means is utilized to cluster the 
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weights per layer and thus finding the 7 cluster centers (q; values in (9)) iteratively. 
This procedure conforms to an implicit assumption that p(x) has the same spread 
inside all clusters. Deep compression can reduce the network size needed for image 
classification by a factor of 35 for AlexNet and a factor of 49 for VGG-16 without 
any loss in accuracy. However, as pointed out in [JKC+18], these networks from 
the early deep learning days are over-parameterized and a less impressing com- 
pression factor is thus expected when the same technique is applied to lightweight 
architectures, such as MobileNet and SequeezeNet. For instance, considering 
SqueezeNet (50 times smaller than AlexNet), the compression factor of deep 
compression without accuracy loss drops to about 10. 

Compared to scalar quantization used in deep compression, there were attempts to 
exploit the structural information by applying variants of vector quantization of the 
weights [GLYB14, CEL20, SJG+20]. Remarkably, in the latter (i.e., [SJG+20]), the 
reconstruction error of the activations (instead of the weights) is minimized in order 
to find an optimal codebook for the weights, as the ultimate goal of quantization is 
to approximate the network’s output not the network itself. This is performed in a 
layer-by-layer fashion (as to prevent error accumulation) using activations generated 
from unlabeled data. 

Other techniques [MDSN17, JKC+18] apply variants of so-called “linear” quan- 
tization, i.e., the quantization staircase has a fixed interval size. This paradigm 
conforms to an implicit assumption that p(x) in (9) is uniform and is thus also 
called uniform quantization. The uniform quantization is widely applied both in spe- 
cialized software packages, such as Texas Instruments Deep Learning 
Library (automotive boards) [MDS+18], and in general-purpose libraries, such as 
Tensorflow Lite. The linearity assumption enables practical implementations, 
as the quantization and dequantization can be implemented using a scaling factor 
and an intercept, whereas no codebook needs to be stored. In many situations, the 
intercept can be omitted by employing a symmetric quantization mapping. More- 
over, for power of 2 ranges, the scaling ends up being a bit-wise shift operator, where 
quantization and dequantization differ only in the shift direction. It is also straight- 
forward to apply this scheme dynamically, i.e., for each tensor separately using a 
tensor-specific multiplicative factor. This can be easily applied not only to filters 
(weight tensors) but also to activation tensors (see, for instance, [MDSN17]). 

Unless the scale factor in the linear quantization is assumed constant by con- 
struction, it is computed based on the statistics of the relevant tensor and can be 
thus sensitive to outliers. This is known to result in a low-precision quantization. In 
order to mitigate this issue, the original range can be clipped and thus reduced to 
the most relevant part of the signal. Several approaches are proposed in the literature 
for finding an optimal clipping threshold: simple percentile analysis of the origi- 
nal range (e.g., clipping 2% of the largest magnitude values), minimizing the mean 
squared error between the quantized and original range in the spirit of (9) [BNS19], 
or minimizing the Kullback—Leibler divergence between the original and the quan- 
tized distributions [Mig17]. While the clipping methods trade off large quantization 
errors of outliers against small errors of inliers [WJZ+20], other methods tackle the 
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outliers problem using a different trade-off, see, for instance, the outlier channel 
splitting approach in [ZHD+19]. 

An essential point to consider when deciding for a quantization approach for a 
given problem is the allowed or intended interaction with the training procedure. 
The so-called post-training quantization, i.e., quantization of a pre-trained network, 
seems to be attractive from a practical point of view: no access to training data is 
required and the quantization and training toolsets can be independent from each 
other. On the other hand, the training-aware quantization methods often yield higher 
inference accuracy and shorter training times. The latter is a serious issue for large 
complicated models which may need weeks to train on modern GPU clusters. The 
training-aware quantization can be implemented by inserting fake quantization opera- 
tors in the computational graph of the forward pass during training (simulated quan- 
tization), whereas the backward pass is done as usual in floating-point resolution 
[JKC+18]. Other approaches [ZWN+18, ZGY+19] go a step further by quantizing 
the gradients as well. This leads to much lower training time, as the time of the often 
computationally expensive backward pass is reduced. The gradient’s quantization, 
however, is not directly applicable as it requires the derivative of the quantization 
function (staircase-like), which is zero almost everywhere. Luckily, this issue can 
be handled by employing a straight-through estimator [BLC13] (approximating the 
quantization function by an identity mapping). There are also other techniques pro- 
posed recently to mitigate this problem [UMY+20, LM19]. 


11 Discussion 


We have presented an extensive overview of approaches to effectively handle safety 
concerns accompanying deep learning: lack of generalization, robustness, explain- 
ability, plausibility, and efficiency. It has been described which lines of research 
we deem prevalent, important, and promising for each of the individual topics and 
categories into which the presented methods fall. 

The reviewed methods alone will not provide safe-ML systems as such and neither 
will their future extensions. This is due to the limitations of quantifying complex real- 
world contexts. A complete and plausible safety argumentation will, thus, require 
more than advances in methodology and theoretical understanding of neural network 
properties and training processes. Apart from methodological progress, it will be 
necessary to gain practical experience in using the presented methods to gather 
evidence for overall secure behavior, using this evidence to construct a tight safety 
argument, and testing its validity in various situations. 

In particular, each autonomously acting robotic system with state-of-the-art deep- 
learning-based perception and non-negligible actuation may serve as an object of 
study and is, in fact, in need of this kind of systematic reasoning before being trans- 
ferred to widespread use or even market entry. We strongly believe that novel scien- 
tific insights, the potential market volume, and public interest will drive the arrival 
of reliant and trustworthy AI technology. 
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Abstract While automated driving is often advertised with better-than-human driv- 
ing performance, this chapter reviews that it is nearly impossible to provide direct 
statistical evidence on the system level that this is actually the case. The amount of 
labeled data needed would exceed dimensions of present-day technical and econom- 
ical capabilities. A commonly used strategy therefore is the use of redundancy along 
with the proof of sufficient subsystems’ performances. As it is known, this strategy 
is efficient especially for the case of subsystems operating independently, i.e., the 
occurrence of errors is independent in a statistical sense. Here, we give some first 
considerations and experimental evidence that this strategy is not a free ride as the 
errors of neural networks fulfilling the same computer vision task, at least in some 
cases, show correlated occurrences of errors. This remains true, if training data, archi- 
tecture, and training are kept separate or independence is trained using special loss 
functions. Using data from different sensors (realized by up to five 2D projections of 
the 3D MNIST dataset) in our experiments is more efficiently reducing correlations, 
however not to an extent that is realizing the potential of reduction of testing data 
that can be obtained for redundant and statistically independent subsystems. 
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1 Introduction 


The final report of the ethics committee on automated and connected driving 
[FBB+17] at the German Federal Ministry of Transportation and Digital Infrastruc- 
ture starts with the sentences! “Partially and fully automated traffic systems serve 
first and foremost to improve the safety of all road users. [...] Protecting people takes 
precedence over all other utilitarian considerations. The goal is to reduce harm up 
to complete prevention. The approval of automated systems is only justifiable if, in 
comparison with human driving performance, they promise at least a reduction of 
damage in the sense of a positive risk balance”. This pronounced statement sets the 
highest safety goals. In this chapter, we contemplate the feasibility of a justification 
based on direct empirical evidence. 

If it comes to automated driving, the elephant in the room is the outrageous amount 
of data that is needed to empirically support the safety requirement set up by the ethics 
committee with direct measurement. This chapter, however, is not the elephant’s first 
sighting; see, e.g., [KP16], where it is shown that hundreds of millions to billions 
of test kilometers are required for statistically valid evidence of better-than-human 
driving performance by automated vehicles. While in this chapter the basic statistical 
facts on the measurement of the probability of rare events are revisited and adapted 
to a German context, we slightly extend the findings by estimating the data required 
for testing Al-based perception functionality using optical sensors along with an 
estimate of the labeling cost for a sufficient test database. 

What is new in this chapter is a statistical discussion and preliminary experimental 
evidence on redundancy as a potential solution to the aforementioned problem. The 
decomposition of the system into redundant subsystems, each one capable to trigger 
the detection of other road users without fusion or filtering, largely reduces the amount 
of data needed to test each subsystem. However, this is only true if the failure of the 
subsystems is statistically independent of the other subsystems. This leads to the 
question of (a) how to measure independence and (b) whether the actual behavior of 
neural networks supports the independence assumption. 

A study on the role of independence in ensembles of deep neural networks was 
presented in [LWC+19], where the goal was rather (1) to improve performance 
by selecting ensemble members according to different diversity scores and (2) to 
obtain robustness against adversarial attacks. In [WLX+20], a number of different 
consensus algorithms, i.e., ensemble voting algorithms, are compared according to 
different evaluation metrics. Also in that work, networks are trained independently 
and selected afterwards. 

In our own studies of independence of the occurrence of error events in the predic- 
tion of neural networks, we provide experiments for classification with deep neural 
networks on the academic datasets EMNIST [CATvS17], CIFAR10 [Kri09], and 3D- 
MNIST.” We consider networks with an increasing degree of diversity with respect to 
training data, architecture, and weight initialization. Even the most diverse networks 


' Translated from German. 
? https://www.kaggle.com/daavoo/3d-mnist. 
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exhibit a Pearson correlation close to 0.60, which clearly contradicts the hypothesis 
of independence. Also, re-training committees of up to 5 networks with special loss 
functions to increase independence between committee members by far does not 
achieve the k-out-of-n performance predicted for independent subsystems [Zac92, 
ME14]. While it is possible to bring the mean correlation down to zero by special 
loss functions in the training, this, at least in our preliminary experiments, at the same 
time deteriorates the performance of the committee members. As the main take-away, 
redundancy does not necessarily provide a solution to the testing problem. 

In this chapter, we do not aim at presenting final results, but only to provide a 
contribution to an inevitable debate. 

Our chapter is organized as follows: In the next section, we evaluate some numbers 
from the traffic by motor vehicles in the last pre-pandemic year in Germany, 2019. 
In Sect.3, we recall some basic facts on the statistics of rare events of an entire sys- 
tem or a system of redundant subsystems. While independent redundant subsystems 
largely reduce the amount of data required for testing, we also consider the case of 
correlated subsystems for which the data requirements scale down less neatly. Also, 
we discuss the amount of data required to actually prove sufficiently low correla- 
tion. In Sect.4, we test neural networks for independence or correlation for simple 
classification problems. Not surprisingly, we find that such neural networks actually 
provide correlated error schemes, and the system performance falls far behind the 
theoretically predicted performance for the error of statistically independent classi- 
fiers. This holds true even if we train networks to behave independently or feed the 
networks with different (toy) sensors. This demonstrates that independence cannot 
be taken for granted and it might be even hard to achieve through training methods. 
We give our conclusions and provide a brief outlook on other approaches that have 
the potential to resolve the issue of outrageous amounts of labeled data for a direct 
assurance case in the final Sect. 5. 


2 How Much Data is Needed for Direct Statistical Evidence 
of Better-Than-Human Driving? 


We focus on the loss of human life as the most important safety requirement. Our 
frame of reference is set by the traffic in Germany in the year 2019. For this year, 
the Federal Ministry of Transport and Digital Infrastructure reports 3,046 fatalities 
which results in 4.0 fatalities per billion kilometers driven on German streets in total 
and 1.1 fatalities per billion kilometers on motorways; see [Ger20, p. 165] for these 
and more detailed data. 

If we neglect that some accidents do not involve human drivers, that in deadly 
accidents oftentimes more than one person is killed, and that a large number of those 
killed did not cause the fatal accident, we obtain a lower bound of at least 250 million 
kilometers driven per fatality caused by the average human driver. Research on how 
much this is underestimating the actual distance is recommended but beyond the 
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scope of this chapter. For an upper bound, we multiply this number by an ad hoc 
safety factor of 10. 

Assuming an average velocity in the range of 50-100 km/h, this amounts to an 
average time of about 2.5—50 million hours or 285—5,700 years of permanent driving 
until the occurrence of a single fatal accident. If a camera sensor works at a frame 
rate of 20-30 fps, (1.8-54) x 10!! frames are processed by the Al-system in this time, 
corresponding to 0.18—5.4 exabyte (1 exabyte = 1 x 10!8 bytes) of data, assuming 
1 megabyte per high-resolution image. 

Several factors can be identified that would leverage or discount the amount of 
data required for a direct test of better-than-human safety. We do not claim that our 
selection of factors is exhaustive and new ideas might change the figures in the future. 
Nevertheless, here we present some factors that certainly are of importance. 

First, due to the strong correlation of consecutive frames, the frame rate of 20- 
30 fps presumably can be reduced for labeled data. Here, we take the approach that 
correlation probably is high if the automated car has driven less than one meter, but 
after 10m driven there is probably not much correlation is left that one could infer 
the safety of the automated car and its environment from the fact that it was safe 10m 
back. At the same time, this approach eliminates the effect of the average traveling 
speed. 

One could argue further that on many frames, no safety-relevant instance of other 
road users is given and one could potentially avoid labeling such frames. However, 
from the history of fatal accidents with the involvement of autopilots we learn that 
such accidents could even be triggered in situations considered to be non-safety- 
critical from a human perspective; see, e.g., [Nat20]. As direct measurement of safety 
should not be based on assumptions, unless they can be supported by evidence, we 
do not suggest introducing a discounting factor as we would have to assume without 
proof that we could separate hazardous from non-hazardous situations. This of course 
does not exclude that a refined methodology is developed in the future that is capable 
to provide this separation and we refer to the extensive literature on corner cases; 
see, e.g., [BBLFs19, BTLFs20, BTLFs21, HBR+21]. 

On the other hand, as we will present in Sect.3, a statistically valid estimate of 
the frequency of rare events requires a leverage factor of at least 3-10 applied on 
the average number of frames per incident; see Sect.3.1 for the details and precise 
values. 

Further, in the presence of a clear trend of the reduction of fatalities in street 
traffic [Ger20] (potentially, partially due to Al-based assistance systems already), a 
mere reproduction of the present-day level of safety in driving does not seem to be 
sufficient. Without deeper ethical or scientific justification, we assume that humans 
expect at least 10-100 times lower failure rate of robots than they would concede 
themselves, while at the same time, we recommend further ethical research and 
political debate on this critical number. Even with a reduction number of 100, the 
approximately 30 fatalities due to autonomous vehicles would well exceed the risk 
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of being struck by lightning causing ~4 fatalities per year in Germany,’ which often 
is considered as a generally acceptable risk. 

In addition, several failure modes exist aside from AI-based perception that cause 
fatalities in transportation. We therefore must not reserve the entire cake of acceptable 
risk to perception-related errors, only. Instead, here we suggest a fraction of a to A 
of the entire acceptable risk for perception-related root causes of fatalities. Also at 
this place, we recommend more serious research, ethical consideration, and public 
debate. 

Drawing all this together, we obtain a total number of frames that ranges from 
1.50 trillion frames in the best-case scenario to 23,000 trillion frames, or 1.5—23,000 
exabyte (in the year of reference 2019, the entire Internet contained 33,000 exabyte* 
of data). This computation is summarized from Table 1. 

Note that replacing fatalities with injuries reduces the amount of data by roughly 
a factor of one hundred (exact number for 2019 4/509) [Ger20]. 

Direct measurement of reliability of an Al-perception system requires labeled 
data, and the time to annotate a single frame by a human ranges from a couple of 
minutes for bounding box labeling up to 90 min for a fully segmented image [Sch19]. 
Working with the span of 5—90 min per annotated frame and a span of wages from 
a minimum wage of 9.19 Euro for Germany in our year of reference 2019 as lower 
bound to 15 EUR as upper bound, the cost of labeling a single image ranges from 
0.775 to 22.5 EUR. 

The total cost for labeling of test data to produce direct statistical evidence there- 
fore ranges from 1.16 trillion in the best case to 51,800 trillion Euro in the worst case. 
This compares to 3.5 trillion Euro of Germany’s gross domestic product in 2019. 

We conclude in agreement with [KP 16] that direct and assumption-free statistical 
evidence of safety of the AI-based perception function of an automated vehicle that 
complies with the safety requirements derived from the ethics committee’s final 
report is largely infeasible with the present-day technology. 

Certainly, this does not say anything about whether an AI-based system for auto- 
mated driving actually would be driving better than human. Many experts, including 
the authors, believe it could be, at least in the long run. But the subjective belief of 
technology-loving experts—in the absence of direct evidence—is certainly insuffi- 
cient to give credibility to the promise of enhanced safety due to automated driving 
in the sense of the ethics committee’s introductory statement. 

This of course does not exclude that safety arguments which are based on indirect 
forms of evidence are reasonable and possible, if they are based on assumptions 
that can be and are carefully checked. In fact, in the following, we discuss one such 
potential strategy based on redundancy and report some problems and some progress 
with this approach applied to small, academic examples. 


3 https://www.vde.com/de/blitzschutz/infos/bitzunfaelle-blitzschaeden#statistik. 
4 Here, for clarity, we use powers of 10, e.g., 1000, instead of powers of 2, e.g., 1024. 
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Table 1 Factors and numbers that influence the data requirement for a statistically sound assurance 
case by direct testing. Safety factors (¢) multiply and reduction factors (|) divide the number of 


frames/cost 
Description | t{ Quantity Quantity Unit Source Frames Frames 
Lower Upper Lower Upper 
Bound Bound Bound Bound 
Meters to 2.50 x 2.50 x m [Ger20] 
fatal 101! 10" & ad hoc 
accident 
(2019) 
Meters 4 1 10 m/f ad hoc 2.50 x 2.50 x 
driven assumption | 10!° 10!? 
per frame 
Factor for | + 2.99 9.21 factor Sect.3.1 | 7.49 x 10!°| 2.30 x 
stat. evi- a=5% |a=0.01% 1013 
dence 
Add. safety | + 10 100 factor ad hoc 7.49 x 10"! | 2.30 x 
by assumption 10!5 
autom. driv- 
ing 
Fraction of | | b } factor ad hoc 1.50 x 2.30 x 
perception assumption | 10!? 10!6 
risk from 
total risk 
Cost (EUR) | Cost (EUR) 
lower upper 
bound bound 
Labeling i 5 90 min [Sch19] 
time & ad hoc 
per frame 
Hourly t 9.19 15 EUR/h minimum 
wages wage GER 
2019 & ad 
hoc 
Cost per + 0.775 22.5 EUR/f 2 rows 
frame above 
Total cost EUR framesx | 1.16 x 10!? | 5.18 x 1017 
cost per 
frame 


3 Measurement of Failure Probabilities 


3.1 Statistical Evidence for Low Failure Probability 


In this subsection, we provide the mathematical reasoning for the leverage factor of 
2.99-9.21 that accounts for statistical evidence. Let us denote by p the actual proba- 
bility of a fatal accident for one single kilometer of automated driving. We are looking 
for statistical evidence that p < pi. = fiol * Phuman> Where fioi is a debit factor for 
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pare safety of robots and multiple technical risks. From Table 1, we infer that 
fa Eli wo x! taking into account the fraction of perception risk from total and 
the additional safety due to automated driving, cf. Sect.2. Here, Phuman © STT 
is the (estimated, upper bound) probability of a fatal accident caused by a human 
driver per driven kilometer. With p = ~ W, We denote the estimated probability of a 


fatal accident per kilometer driven for he autonomous vehicle based on the observed 
number of fatal accidents Nobs on Nest kilometers of test driving. We want to test 
for the alternative hypothesis H; that p is below p,q at a level of confidence 1 — a 
with a € (0, 1) a small number, e.g., a = 5%, 1%, 0.1%, 0.01% or even smaller. We 
thus assume the null hypothesis Ho that p > p,,) using that under the null hypothesis 
Nobs ~ BC Mest, Pto) is Bernoulli-distributed with probability Pio and Mest repeti- 
tions. The exact one-sided Bernoulli test rejects the null hypothesis and accepts H; 
provided that 


L-a@S<  PN~B(Mest.Ptor) NV > Nobs) = 1 — Pr~BMest.proy) NY < Nobs) 


Nobs 
Nes nad ue Ü 
=1- J (1) a) 0- pop". 
j=0 


The reasoning behind (1) is the following: Assume Ho, i.e., the true probability 
of a fatal accident due to the autonomous vehicle would be higher than p,- Then, 
with a high probability of at least 1 — œ we would have seen more fatal accidents 
than just Nops, which we actually observed. This puts us in front of the alternative 
to either believe that in our test campaign we just observed an extremely rare event 
of probability a, or to discard the hypothesis Ho that the safety requirements are not 
fulfilled. 

Let us suppose for the moment that the outcome of the test campaign is ideal, 
i.e., no fatal accidents are observed at all, i.e., Nobs = O. In this ideal case, (1) is 
equivalent to 


Nes In(a) _ 
Meest ~ 


a= (1 = Pior) ln (1 Prot) X Pio (2) 


where we used the first-order Taylor series expansion of the natural logarithm at 1, 
which is highly precise as p,,) is small. Thus, even in the ideal case of zero fatalities 
observed, Nest Z = no ig required. For a ranging from 5% to 0.01%, —In(a) 
roughly ranges from y (numerical value 2.9976) to 10 (numerical value 9.2103). 
This explains the back of the envelope estimates in Sect. 2. 

Note that the approach of [KP16] differs as it is based on a rate estimate for the 
Poisson distribution. Nevertheless, as binominal and Poisson distribution for low 
probabilities approximate each other very well, this difference is negligible, as the 
difference is essentially proportional to the event of two or more fatal incidents in 
one kilometer driven. 
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3.2 Test Data for Redundant Systems 


Assuming independence of subsystems: Let (x, y) be a pair of random variables, 
where x € represents the input data presented to two neural networks hı and 
hy, and y € Y denotes the corresponding ground truth label. We assume that (x, y) 
follows a joint distribution P possessing a corresponding density p. For each neural 
network, the event of failure is described by 


Fi:=th@ Ay}, 1=1,2, 
with 1, being their corresponding indicator variables that are equal to one for an 


event in F; and zero else. If and only if we assume independence of the events F;, 
we obtain 


[Lm - Lm] = PF Fa) = PEF) PA) = Elly] EUa], 


which implies that the covariance fulfills 


COV(IF,, 1F,) = SLF, 5 IF] = ULA] É Ullr] =0. 


This is easily extended to n neural networks h;(x), i € Z = {1,..., n} and their 
corresponding failure sets F;. Under the hypothesis of independence of the family 
of events F;, we obtain 


Psystem = ‘| i= = [PA = ] [ Pow.: 


icT icT icT 


where P,ystem 1S the probability of failure of a system of #Z = n redundant neural 
networks working in parallel, where failure is defined as all networks being wrong 
at the same time [ME14] and Pub; = P(4;) is the probability of failure for the ith 
subsystem A; (x). 

Let us suppose for convenience that the probability for the subsystems A; (x) are all 
equal, Doubs = Psub- Then Pyytem = Peup- In order to give evidence that pyystem < Pror» 


it is thus enough to provide evidence for P ub = Psub,i < p, for i € Z. Subsystem 
testing to a confidence of (1 — a) on the system level requires a higher confidence 
at the subsystem level, which, by a simple Bonferroni correction [HS16], can be 
conservatively estimated as (1 — “). Consequently, by (2) the amount of data for 
testing the subsystem A; (x) is given by 


> Pi > Meest,i Fae 1 : (3) 
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AS py is much larger than Pi, the amount of testing data required is radically 
reduced, even if one employs n separate test sets for all n subsystems. By comparison 
of (2) and (3), the factor of reduction is roughly 


Meest _ 1 
nNeest,i n oF (1 2 na) ` 


tol In(a) 


Yn = 


For example, for n = 2 in the best-case scenario, a = 5%, and fio = 5 the reduc- 
tion factor is 72 = 28,712, which reduces the corresponding 1.5 x 10'* frames to 
52.2 million frames. This already is no longer an absolutely infeasible number. 
For the worst-case scenario, a = 0.01% and fioi = a the reduction factor is 
V2 = 232,502, resulting in 98.9 billion frames, which seems out of reach, but not to 
the extent of 2.3 x 10!° frames. 

Keeping the other values fixed, n = 3 even yields a reduction factor of 73 = 
974,672 in the best-case scenario and 73 = 11,818,614 in the worst-case scenario, 
resulting in 1,54 million frames in the best and 1,95 billion frames in the worst- 
case scenario. These numbers look almost realizable, given the economic interests 
at stake. 

However, the strategy based on redundant subsystems comes witha catch. Itis only 
applicable, if the subsystems are independent. But this is an assumption that is not 
necessarily true. We therefore investigate what happens, if the errors of subsystems 
are not independent. 


Assuming no independence of subsystems: In this case, the covariance of the error 
indicator variables 1, is not equal to zero and can be regarded as a measure of 
the joint variability for the random variables 1+,. The normalized version of the 
covariance is the Pearson correlation 


COV(1s,, 17) 
olr) oy) 


PUF, IA) = [=1;1]; 


where o(1¢,) = V/Psub,i (1 — Psub,;) denotes the standard deviation of 1+,, which is 
supposed to be greater than zero for i = 1, 2. The correlation measures the linear 


relationship between the random variables | ¢, and takes values +1 if the relationship 
between the random variables is deterministic. 

Let us first consider a system with two redundant subsystems in parallel, A; (x), 
i = 1, 2, where we however drop the assumption of independence. Then we obtain 


Psystem = Slr, - 1y,] 
= COV(Iz, lr) t Biin] Ella) (4) 


ps, , 1r) Paub,1 (1 = Panb, 1) Psub,2(1 z Psub,2) F Psub,1P sub,2- 
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Assuming again equal failure probabilities for the subsystems Pub = Psub,1 = Psub.2 
and using 1 — p,,, © 1 as a good approximation as p,,,, for a safe system is very 
small, we obtain from (4) 


Psystem © PUF, 1F,)Psub + Babs (5) 


i.e., we can only expect a reduction of the frames needed for testing which is compa- 
rable to the case, where statistical independence holds, if p(1¢,, 1z,) is of the same 
small order of magnitude as P.u. If, e.g., we assume an extremely weak correlation 
of e(l F, 1F,) = 0.01, we can essentially neglect the PŽe-term as Posub X 0.01 and 
realize that the reduction factor essentially is EES = 100, only. Thus, even for 
such a pretty uncorrelated error scheme, the number of frames required for test- 
ing would be lower bounded by 1.5 x 10!° to 2.3 x 10!4 frames, even neglecting 
Bonferroni correction and independent test sets which make up a multiplication fac- 
tor By = n(1 — ee) yielding By = 2.46 and B2 = 2.15, respectively. With these 
effects taken into account, we arrive at 36.9 billion frames in the best-case scenario 
and 4.94 x 10!* frames in the worst case, where even the lower number of frames 
seems hardly feasible. 

A related computation for n = 3 yields to leading order, using (5) and approxi- 


mating the complement of small probabilities with one and neglecting terms of order 


2 
Psub> 


Psystem = [1x . ly, i Ix] 


Q 


PUFAN z» 17) i [1F, x Iz] Psub s |17, i ly] Psub 


Q 


PU Finis, ’ 1z) (Pz , LF, )Psub + Pax) Psub (6) 
F (PdF, , LF, )Psub T Pab) Psub 


POF nig,» 1F,) PUF, 1F,)Psub ? 


If we thus assume that both correlations p(1¢,n1 P 17,) between the failure of sub- 
system h3(x) and the composite redundant subsystem from h3 (x) and h2 (x) are both 
equal to 0.01, we obtain a total reduction factor of roughly 1 


Q 


PUF nig, IF Plr LF) 
1,000, which still leads to roughly 1.50 x 10° — 2.30 x 10!3 frames, even without 
Bonferroni correction and independent test sets for subsystems. With both taken into 
account, the amount of data ranges from 5,04 billion frames in the best-case scenario 
to 9.43 x 10! frames in the worst case, where only the figure obtained in the best 
case, based on problematic choices, seems remotely feasible. However, in the pres- 
ence of domain shifts in time and location, it seems questionable if the road of testing 
weakly correlated subsystems is viable (supposed they are weekly correlated). 

We also note that correlation coefficients as low as p = 0.01 are rarely found 
in nature and in addition, it requires empirical testing to provide evidence for a low 
correlation. The correlations we measure in Sect. 4 for the case of simple classification 
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problems miss this low level by at least an order of magnitude, leading to an extra 
factor of at least 10 in the above considerations. 


3.3 Data Requirements for Statistical Evidence of Low 
Correlation 


In the preceding section, we have analyzed that low correlation between subsys- 
tems efficiently reduces the data required for testing better-than-human safety of 
autonomous vehicles. However, to achieve this, e.g., in the case of two redundant 


subsystems, the correlation has to be in the order of magnitude of p ub = Pe). Statis- 
tical evidence for such a low level of correlation requires data itself. Let us suppose 
the ideal situation once more that a correlation coefficient is strictly zero p = 0 and 
we would like to compute the number of pairs of observations of the random vari- 
ables (1¢,, 1¢,) that is needed to prove that p is in the order of magnitude pP,ub; as 
required for a decent downscaling of the number of test data frames. In other words, 
we have to estimate a number of samples needed to provide statistical evidence at a 
1 


given significance level a that p(1¢,, 17,) is less than py, = Pr- 
As shown by Raymond Fisher and others, see, e.g., [LL88], the quantity Z= 


5 log (#2) is asymptotically normally distributed with expected value 5 log( 4) = 
0 in our case, where we assumed p = 0. The standard deviation is given by na 


Here, ô stands for the empirical correlation coefficient of the pair of observations 
[HS 16]. 

A two-sided confidence interval for a given level of confidence 1 — a for the 
observed value 2 of Z thus is given by 


À A | 1 
Z+ =Z4 21-2 New — 3? (7) 
test ~~ 


where z;_« is the 1 — 5 quantile of the standard normal distribution. Transforming 


back (7), we obtain lower and upper bounds 


ao exp(22+) ~ 1 


~ exp(224) + 1 (8) 


Under our best-case hypothesis p = 0, the boundaries 2+ of the confidence interval 
d exp(2z)—1 — 
dz exp(2z)+1 z=0 m 


converge to zero; we may apply the d-rule with the derivative 


4exp(2z) 
(exp2z)+1)? | -o 


the best possible outcome obtained for z = 0. By (8) and the 6-rule, asymptotically 


= 1. Let us consider the width W of the confidence interval for p for 


for large Nest it is given by W = 2z)_4 |y Even if this is the case, to infer 
2 test 
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that |p| < p,,, with confidence 1 — a, one requires, for the case of two subsystems, 
1 
Z-4 w < Doub = Por: If Mest is large, we can neglect the —3 term and obtain 


for the best case 


Mest © Z, (9) 


Not unexpectedly, this brings back the bad scaling behavior observed in (2) and the 
related problematic data requirements, which are essentially the same as for the non- 
redundant, direct approach. The numbers for zia fora = 5% ...œa = 0.01% range 
from 2.706 to 13.831 which essentially confirms the range of roughly 3 ...10 for the 
statistical safety factor obtained from (1) and (2). 


4 Correlation Between Errors of Neural Networks in 
Computer Vision 


As of now, deep neural networks (DNNs) for perception tasks are far away from being 
perfect. Motivated by common practices in reliability engineering, redundancy, i.e., 
the deployment of multiple system components pursuing the same task in parallel, 
might be one possible approach toward improving the reliability of a perception 
system. 

Redundancy can enter into perception systems in many ways. Assume a system 
setup with multiple sensors, e.g., camera, LIDAR, and RaDAR. There are multiple 
options to design a deep-learning-driven perception system processing the different 
sensors’ data. A non-exhaustive list of designs may look as follows: 


. Only a single sensor is processed; this is done by a single DNN; 

. Only a single sensor is processed; this is done by a committee of DNNs; 

. All sensors are processed by a single-sensor-fusing DNN; 

. All sensors are processed by a committee of sensor-fusing DNNs; 

. Each sensor is processed by a separate DNN; the results are fused afterwards; 

. Each sensor is processed by a committee of DNNs; the results are fused afterwards. 


NNnfWNY 


Except for the first design, all other designs incorporate redundancy. Herein, there 
are two types of redundancy, redundancy via multiple sensors (all pursuing the same 
task of perceiving the environment) and redundancy via multiple DNNs. 

Certainly, approach one is only eligible for direct testing, see Sect.3.1, and the 
same is true for the “early fusion” approach 3. All the other approaches could poten- 
tially benefit from redundancy, if independence or low correlation of the errors can be 
assumed. Therefore, the degree of independence, which can be understood and quan- 
tified as the degree of uncorrelatedness, is a quantity of interest for safety and also for 
testing; see Sect. 3.2. However, as explained in Sect. 3.2, in order to use redundancy 
as a part of the solution to the testing problem outlined in Sect. 2, correlation has to 
be extremely low. 
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For the case of simple DNNs processing the same sensor’s data, we give evidence 
that such low correlation in general does not hold. The evidence we find rather points 
in the opposite direction that it is hard to obtain a correlation that is below 0.5, even 
if the training datasets and network architecture are kept well separated. On the other 
hand, one could try to train DNNs such that their failure events are uncorrelated. 

In this section, we show preliminary numerical results on MNIST (handwritten 
digits), EMNIST (also containing handwritten letters), and 3D-MNIST for 


e training DNNs for independence/less correlated failure events; 
e the role of different sensors (by viewing 3D-MNIST examples from different 
directions). 


Although our findings do not directly apply to physically more diverse sensors 
like Camera, LiDAR, and RaDAR, these preliminary results indicate that the inde- 
pendence of DNN-based perception systems cannot be taken simply for granted, 
even if different sources of data are employed. 


4.1 Estimation of Reliability for Dependent Subsystems 


Most commonly, the so-called active parallel connection of n subsystems is used, 
wherein the entire system (meaning the ensemble of DNNs) is assumed to be func- 
tional iff at least one subsystem (which corresponds to one committee member h;) 
is functional. However, the active parallel connection is not the only decision rule of 
interest which can be applied to the committee h;, i = 1,...,. For instance, con- 
sidering a pedestrian detection performed by a committee h; that detects a pedestrian 
if at least one of the DNNs does so. For increasing n, we would expect an increase 
in false positive detections, therefore facing the typical trade-off of false positives 
and false negatives. In order to steer this trade-off, we use k-out-of-n systems that 
are functional iff at least k out of n components h; are functional, i.e., at most n — k 
components fail. Hence, we are interested in the event 


pon <a- 
i=1 


and its probability which is the probability of the k-out-of-n system being functional. 
The reliability of k-out-of-n system can be expressed analytically in terms of the 
reliability of its components; see also [Zac92]. For k = 1, this boils down to the 
active parallel connection. 

If we assume independence of the failure events F; and that all networks fail with 
equal probability P(F;) = Psub.i = Psub: then the probability that at least k-out-of-n 
networks are functional can be calculated via 


94 H. Gottschalk et al. 


n n 
n ; "i 
P (Èn <=) = >S (") qd = Psub)” ` Pob = 1 — Fg(k — 1; n, 1 — Psub), 
i=l 


jak 
(10) 
where Fg denotes the distribution function of the binomial distribution. This quantity 
serves as a reference in our experiments. 


4.2 Numerical Experiments 


In this section, we conduct the first experiments using the datasets EMNIST, 
CIFAR10, and 3D-MNIST. The original dataset MNIST [LCB98] contains 60,000 
grayscale images of size 28 x 28 displaying handwritten digits (10 classes). EMNIST 
is an extension of MNIST that contains handwritten digits and characters of the same 
28 x 28 resolution. Of that dataset, we only considered the characters (26 classes) of 
which there are 145,600 available. We used 60,000 images for training, 20,000 for 
validation, and the rest for testing. CIFAR10 contains 60,000 RGB images of size 
32x32 categorized into 10 classes (from the categories animals and machines). We 
used the default split of 50,000 training and 10,000 test examples. We reserved a 
quarter of the training set for validation. 3D-MNIST contains point clouds living on 
a 16°-lattice. The dataset contains 10,000 training and 2,000 test examples. 

We used convolutional DNNs implemented in Keras [C+15] with simple archi- 
tectures; if not stated otherwise they contain 2 convolutional layers with 32 and 64 
3x 3-filters, respectively, each of them followed by a leakyReLU activation and 2x2 
max pooling, and finally a single dense layer. For training, we used a batch size of 
256, weight decay of 1074, and the Adam optimizer [KB15] with default parame- 
ters, except for the learning rate. We started with a learning rate of 107?, trained until 
stagnation, and repeated that with a learning rate of 107°. 


Reducing correlations with varying training data, architecture, and weight ini- 
tializers: First, we study to what extent independence can be promoted by varying 
the training data, architecture, and weight initializers in an active parallel system 
with n = 2 DNNs. To this end, we split the training data and validation data into 
two disjoint chunks; consider another network, where we add a third convolutional 
layer with 128 filters, again followed by leakyReLU and max pooling, and use the 
Glorot uniform and Glorot normal initializers with default parameters. For the sake 
of brevity, we introduce a short three-bit notation indicating the boolean truth values 
corresponding to the questions 


(same data? same architecture? same initializer?). (11) 


For instance, 101 stands for two DNNs being trained with the same data, having 
different architectures, and using the same initializer. We report results in terms 
of average accuracy estimating 1 X; 1 — P(F;)) and joint accuracy estimating 
1 — P(N Fi). 
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Table 2 Correlation coefficients and average accuracies for EMNIST and CIFAR10. The config- 
urations read as defined in (11). All runs have been performed 10 times, all numbers are averaged 
over these 10 runs, and the standard deviation over these 10 runs are given 


Configurations EMNIST CIFAR10 

pF, IP) avg. acc. (%) PUF, lr) avg. acc (%) 
111 0.71+0.01 91.17+0.05 0.73+0.01 712.21+0.45 
110 0.710.00 91.17+0.06 0.74+0.01 72.19+0.22 
101 0.71+0.01 91.13+0.04 0.74+0.01 72.0840.43 
100 0.71+0.01 91.13+0.05 0.74+0.01 72.14+0.36 
Oll 0.58+0.01 89.65+0.14 0.66+0.02 66.10+0.74 
010 0.58+0.01 89.76+0.08 0.65+0.01 66.08+0.35 
001 0.57+0.01 89.74+0.07 0.66+0.01 66.29+0.65 
000 0.58+0.01 89.62+0.13 0.65+0.01 66.30+0.38 


Table 2 shows in all cases correlation coefficients much greater than zero. Corre- 
sponding x? tests with significance level œ = 0.05 in all cases rejected the hypothesis 
that the DNNs are independent. Noteworthily, varying the initializer or the network 
architecture barely changes the results while changing the training data seems to 
have the biggest impact, clearly reducing the correlation coefficient. A reduction in 
correlation also reduces the average performance of the networks. For the sake of 
comparability, all networks were only trained with half of the training data, since 
otherwise the configurations with equal data would have the advantage of working 
with twice the amount of training data compared to the configuration with different 
training data. 

Next, we study in this setting whether we can achieve at least conditional indepen- 
dence. To this end, we aim at conditioning the difficulty of the task by conditioning 
to softmax entropy quantiles. More precisely, we compute the entropy of the softmax 
distribution of all data points of both networks. We then sum the entropy values for 
each data point over the two networks and group all examples into 8 equally sized 
bins according to ascending summed entropy. 

Table 3 shows that the higher the softmax entropy gets, the less the DNNs failures 
are correlated. This goes down to correlation coefficients of 0.25 for EMNIST and 
0.23 for CIFAR10, when considering the softmax entropy bin no. 8 with the highest 
entropy values. Still, x? tests reveal that the correlations are too strong to assume 
independent failures. 


Training two networks for enhanced independence: Since the measures consid- 
ered so far do not lead to success, we explicitly try to decorrelate the failures of the 
DNNs. To this end, we incorporate an additional loss function into training, added to 
the typically used empirical cross-entropy loss. Let p(y|x, h;) denote the probability 
estimated by the DNN A; that the class y is the correct one given the input x. One 
possible approach is to explicitly enforce a prediction of h, different from that of h2 


96 H. Gottschalk et al. 


Table 3 Correlation coefficients p(1 7, 1+,) for different quantiles of softmax entropy computed 
on EMNIST and CIFAR10. The configurations read as defined in (11). The experiments have been 
repeated 10 times; the corresponding standard errors are of the order of 0.01 


EMNIST CIFAR10 


Entropy bin | 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 
Config. Correlation coefficients 

111 1.0 0 | 1.0 | 1.0 | 0.99 | 0.94 | 0.71 | 0.45 | 1.0 | 0.99 | 0.94 | 0.79 | 0.64 | 0.57 | 0.45 | 0.26 
110 1.0 | 1.0 | 1.0 | 1.0 | 0.99 | 0.95 | 0.71 | 0.45] 1.0 | 1.0 | 0.96 | 0.79 | 0.67 | 0.55 | 0.44 | 0.31 
101 1.0 | 1.0 | 1.0 | 1.0 | 0.99 | 0.94 | 0.71 | 0.43 | 1.0 | 0.99 | 0.94 | 0.78 | 0.67 | 0.55 | 0.43 | 0.36 
100 1.0 0 | 1.0 | 0.99) 1.0 | .095 | 0.71 | 0.45 | 1.0 | 0.99 | 0.95 | 0.79 | 0.64 | 0.51 | 0.46 | 0.36 
011 1.0 .0 | 0.99 | 0.98 | 0.90 | 0.71 | 0.40 | 0.28 | 0.99 | 0.94 | 0.81 | 0.65 | 0.6 | 0.44 | 0.37 | 0.27 
010 1.0 .0 | 1.0 | 0.98 | 089 | 0.73 | 0.43 | 0.26 | 1.0 | 0.95 | 0.82 | 0.64 | 0.5 | 0.44 | 0.40 | 0.25 
001 1.0 .0 | 0.98 | 0.96 | 0.90 | 0.73 | 0.40 | 0.25 | 0.99 | 0.94 | 0.83 | 063 | 0.54 | 0.43 | 0.37 | 0.27 
000 1.0 .0 | 0.99 | 0.98 | 0.90 | 0.72 | 0.42 | 0.27 | 0.99 | 0.95 | 0.80 | 0.65 | 0.53 | 0.45 | 0.35 | 0.23 


if the latter fails and vice versa. This can be achieved by minimizing the following 
quantity: 


—E«. y~plln wzy log — p(hi(x)|x, hj)) + la;œwzylog(l — p(hj (x) |x, hi)] 


with its empirical counterpart 


m=1 


+ Ln jn)Ayn 1080 — P(Aj %m)/Xm, hi)) » 


where {(Xm, Ym)}m=1....m denotes a sample of M data points (Xm, Ym). In our 
experiments, we use a penalization coefficient/loss weight A and then we add 
A- Ji.2{&n, Ym)}m=1,....m) to the empirical cross-entropy loss. In further experi- 
ments not presented here, we also used other loss functions that explicitly enforce 
anti-correlated outputs or even independence of the softmax distributions. However, 
these loss functions led to uncontrollable behavior of the training pipeline, in partic- 
ular when trying to tune the loss weight À. Therefore, we present results for the loss 
function in (12). 

Figure | (top) depicts the correlation as well as the average accuracy for different 
values of À, ranging from 0 to extreme values of 200. For increasing values of À, we 
observe a clear drop in performance from an initial average accuracy of more than 
91% for A = 0 down to roughly 72% for A = 200. At the same time, the correlation 
decreased to values below 0.3 which, however, is still not enough to assume indepen- 
dence as being confirmed by our y tests. While the average accuracy monotonously 
decreases, it can be observed that the joint accuracy peaks around A = 7.5; see Fig. 1 
(bottom). The prediction of the DNN committee is pooled by summing up softmax 


Does Redundancy in AI Perception Systems Help to Test ... 97 


Fig. 1 Study of the 94.57 
influence of the loss weight e 

A on averaged accuracy and | A=2.5 
joint accuracy on the 
EMNIST dataset 


o 
w 
u 
e 


Joint Accuracy 
Ke 
w 
o 
e 
> 
1 
A 
bd 
o 


92.54 


92.0} y= 
9’ = 200.0 


0.3 0.4 0.5 0.6 0.7 
Correlation 


94.54 


o 
w 
u 
e 


Joint Accuracy 
wo 
w 
oO 
o 
> 
Il 
Ti 
o 
o 


92.54 


92.07 4 = 200.0 
e 


72.5 75.0 77.5 800 825 85.0 87.5 90.0 92.5 
Average Accuracy 


probabilities over both DNN class-wise and then selecting the class with maximal 
sum. The joint accuracy is given by the accuracy of the committee with that deci- 
sion rule. While the joint accuracy for an ordinarily trained committee with A = 0 
is about 93.5%, this can be improved by tenderly decorrelating the DNNs to a cor- 
relation coefficient around 0.5 which yields an increase of almost 1% point. At the 
same time, there is a mild decrease in average accuracy. 

Concluding this section, we again study conditional independence for 8 softmax 
entropy bins (chosen according to equally distributed softmax entropy quantiles), 
this time comparing independence training À = 7.5 with ordinary training À = 0; see 
Table 4. We observe a considerable decrease in correlation in bin no. 8 (representing 
the highest softmax entropy) for the EMNIST data down to 0.07. In comparison, the 
decrease for CIFAR10 is rather mild. However, also this small correlation coefficient 
of 0.07 is not sufficient for assuming conditional independence. 
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Table 4 Correlation coefficients under independence training for different quantiles of softmax 
entropy computed on EMNIST and CIFAR10 
EMNIST CIFAR10 


Entropy bin | 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 


À Correlation coefficients 
75 1.0 | 1.0 | 0.99 | 0.96 | 0.90 | 0.59 | 0.35 | 0.07 | 0.89 | 0.76 | 0.63 | 0.53 | 0.47 | 0.39 | 0.32 | 0.21 
0.0 1.0 |1.0 | 1.0 | 1.0 | 0.99 | 0.95 | 0.72 | 0.44 | 1.0 | 0.99 | 0.94 | 0.80 | 0.64 | 0.57 | 0.42 | 0.33 


Training several networks for independence: We now present results when training 
an ensemble of n = 5 networks. To this end, we sum up the cross-entropy losses of 
the n = 5 models and add the penalization term 


n i-l 


ENS ha (13) 


n—1l 
i=2 j=l 


which is a straightforward combinatorial generalization of the previously used 
penalty term. Therein, A again denotes the loss weight which varies during our 
experiments. 

Besides k-out-of-5 accuracies, we consider also accuracies from single networks 
and ensemble accuracies. The corresponding ensemble prediction is obtained by 
summing up the softmax probabilities (via the class-wise sum over the first k ensem- 
ble members) and then taking the argmax. 

Figure 2 depicts the results of our training for independence with 5 models in 
terms of k-out-of-5 accuracy. When stating mean accuracy, this refers to the average 
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Fig. 2 Experiments with the EMNIST dataset. Left: mean k-out-of-5 accuracy (averaged over 30 
repetitions) for an ensemble of five networks as a function of the loss weight À. The solid lines 
depict the accuracies of the ensembles trained for independence with loss weight À, the dashed 
lines depict baseline ensemble accuracies trained without incorporating the loss from (13). Right: 
k-out-of-5 accuracy for a single run as a function of the mean correlation (of each network with 
each other network) 
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Table 5 Correlation coefficients for an ensemble of 5 networks trained on EMNIST for indepen- 
dence and a baseline ensemble where each network was trained individually. All results are averaged 
over 30 runs 


Loss 0 107! 10° 10! 10? Baseline | Theoretical 
weight À model 
Mean cor- 0.70 0.69 0.65 0.53 0.43 0.67 0.00 
relation 
Single k=1 0.92 0.92 0.91 0.88 0.73 0.91 
network 
accuracy 
k=2 0.92 0.92 0.91 0.91 0.91 0.91 
k=3 0.92 0.92 0.91 0.88 0.60 0.91 
k=4 0.92 0.92 0.91 0.87 0.51 0.91 
k=5 0.92 0.92 0.91 0.85 0.39 0.91 
Ensemble | k = 0.92/0.96 | 0.92/0.96 | 0.91/0.96 | 0.88/0.97 | 0.73/0.93 | 0.91/0.96 | 1.00 
accu- 
racy/Mean 
k-out-of-5 
accuracy 


k=2 0.92/0.94 | 0.92/0.94 | 0.92/0.94 | 0.92/0.93 | 0.91/0.78 | 0.92/0.94 | 0.99 
k=3 0.92/0.92 | 0.92/0.92 | 0.92/0.92 | 0.92/0.90 | 0.90/0.64 | 0.92/0.92 | 0.96 
k=4 0.93/0.90 | 0.93/0.90 | 0.92/0.89 | 0.92/0.85 | 0.90/0.49 | 0.92/0.90 | 0.72 
k=5 0.93/0.86 | 0.93/0.86 | 0.92/0.84 | 0.92/0.76 | 0.90/0.30 | 0.92/0.85 | 0.32 


of over 30 runs. For a loss weight of \ = 10!, we observe in the left panel that the 
1-out-of-5 accuracy of our independence training is slightly above the accuracy of 
the baseline ensemble which was trained regularly, i.e., each network was trained 
independently. Note that this is different from À = 0, where all networks are still 
trained jointly with a common loss function which is the sum of the cross-entropy 
losses. The right-hand panel shows that the 1-out-of-5 accuracy peaks at a mean 
correlation (the average correlation over the errors of all networks i < j) of 0.4. 
Decorrelating the networks’ errors further toward zero-mean correlation is possible, 
however, the 1-out-of-5 accuracy decreases. The k-out-of-5 accuracy for k > 1 suf- 
fers even more from decorrelating the networks’ errors. Note that, in practice, 1-out- 
of-5, e.g., for a pedestrian detection, might additionally suffer from overproduction 
of false positives and could be impractical. That hypothesis is indeed supported by 
Table 5, which shows that for larger loss weights A > 1 the individual network accu- 
racies become heterogeneous. In particular, network 5 suffers from extremely low 
accuracy at \ = 10°, which is, however, still far away from zero-mean correlation. 
For the sake of completeness, we give two examples of correlations between the 
individual models since we only reported mean correlations so far; see Fig. 3. Com- 
paring the right-hand panel with Table5, we see that the well-performing network 
no. 2 exhibits comparatively small correlation coefficients with the other networks’ 
errors. Surprisingly, the worse performing models’ errors show higher correlation 
coefficients which reveals that in that case they often err jointly on the same input 
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Fig. 3 Correlation coefficients of the networks’ errors for all combinations of 2 out of 5 models. 
Left: loss weight \ = 10~!. Right: loss weight \ = 10? 


examples. Additionally, we provide theoretical k-out-of-5 accuracies according to 
(10) where we choose p,,,, equal to 1 minus the average ensemble accuracy (which 
is 92.45%). In particular, the k-out-of-5 accuracies of the ensemble for k = 1 and 2 
are clearly below the theoretical ones. 

We conclude that independence training may help slightly to improve the perfor- 
mance, however, the benefit seems to be limited when all networks are trained with 
the same input. Besides these tests, we considered an additional loss function that 
explicitly penalizes the correlation of the networks’ errors. That approach, being actu- 
ally more directed toward our goal, even achieved negative correlation coefficients of 
the networks’ errors. It was also able to slightly improve the ensembles’ performance 
in terms of k-out-of-n accuracy over the baseline, however, this improvement was 
even less pronounced than that one reported in this section. Thus, we do not report 
those results here. 


Training of independence for different input sensors: In order to conduct further 
experiments on the scale of the EMNIST and CIFAR10 datasets, we consider the 
3D-MNIST dataset that provides synthesized 3D representations of handwritten dig- 
its. We apply rotations around the x-axis with chosen angles. To this end, we create 
5 rotated variants of the dataset with a randomly chosen but fixed angles 0 = a7/9, 
a=1,...,9; see Fig.4 for an illustrative example. Each of our k = 1,...,5 net- 
works obtains one of the 5 rotated variants of the data with a given angle 0%, and is 
then trained. For the baseline, all networks are again trained independently. For inde- 
pendence training, all networks are trained with common loss functions, as described 
previously, and the same handwritten digit, however, viewed from a different angle 
0k, is presented to the network k, k = 1,...,5. 

Figure 5 shows results of numerical experiments conducted analogously to those 
presented in Fig.2. Comparing both figures with each other, we observe that the 
baseline k-out-of-5 accuracies are much lower than when presenting identical images 
to the networks. Indeed, although MNIST classification constitutes a much simpler 
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Fig. 4 A 3D-MNIST example data point. Left: original input data. Center: input data rotated by 
m/3 along the x-axis. Right: 2D projection of the rotated data, which is obtained by summation and 
normalization along the y-axis 
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Fig. 5 Experiments with the 3D-MNIST dataset. Left: mean k-out-of-5 accuracy (averaged over 
30 repetitions) for an ensemble of five networks as a function of the loss weight À. The solid lines 
depict the accuracies of the ensembles trained for independence with loss weight À, the dashed 
lines depict baseline ensemble accuracies trained without incorporating the loss from (13). Right: 
k-out-of-5 accuracy for a single run as a function of the mean correlation (of each network with 
each other network) 


task than classifying the letters from EMNIST; this 3D variant shows much lower 
individual network performances, not only indicated by the left-hand panel of Fig. 5, 
but also by Table 6. On the other hand, the highest mean correlation (obtained for 
the loss weight A = 0) for 3D-MNIST depicted by the right-hand panel of Fig. 5 is 
below 0.3 and therefore much lower than the one depicted by Fig. 2. 

Neglecting the reduced performance on 3D-MNIST, these results show that an 
ensemble, wherein each network obtains a different input, can be clearly improved 
by increasing the number of ensemble members. The networks only exhibit small 
correlations among each other. However, it seems that independence training can- 
not contribute to the performance of the ensemble anymore in the given setup. It 
remains open, whether independence training may help in a realistic setting with 
different sensors for perception in automated driving and networks conducting way 
more difficult tasks such as object detection, instance segmentation, semantic seg- 
mentation, or panoptic segmentation. Note that the correlation strengths reported in 
our experiments in accordance with Sect. 3 do not suffice to substantially reduce the 
data requirements into a feasible regime. Recalling the discussion in Sect.3.2, even 
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Table 6 Results for an ensemble of 5 networks trained on 3D-MNIST for independence and a 
baseline ensemble where each network was trained individually. All results are averaged over 30 
runs 


Loss 0 107! 10° 10! 10? Baseline | Theoretical 
weight À model 
Mean cor- 0.25 0.25 0.24 0.22 0.15 0.24 0.00 
relation 
Single k=1 0.70 0.70 0.70 0.66 0.38 0.70 
network 
accuracy 
= 0.74 0.73 0.74 0.73 0.75 0.75 
= 0.73 0.72 0.72 0.67 0.25 0.74 
0.81 0.80 0.79 0.73 0.17 0.82 
k=5 0.72 0.73 0.71 0.64 0.20 0.74 
Ensemble | k = 0.70/0.97 | 0.70/0.97 | 0.70/0.97 | 0.66/0.96 | 0.38/0.88 | 0.70/0.97 | 1.00 
accu- 
racy/mean 
k-out-of-5 
accuracy 


k=2 0.82/0.91 | 0.82/0.91 | 0.82/0.91 | 0.81/0.89 | 0.77/0.52 | 0.82/0.92 | 0.98 
k=3 0.84/0.81 | 0.84/0.80 | 0.84/0.80 | 0.83/0.75 | 0.78/0.22 | 0.85/0.82 | 0.90 
0.87/0.63 | 0.86/0.63 | 0.86/0.62 | 0.85/0.55 | 0.79/0.10 | 0.87/0.65 | 0.63 
k= 5 0.88/0.38 | 0.88/0.37 | 0.88/0.37 | 0.86/0.29 | 0.79/0.03 | 0.89/0.39 | 0.24 


if the subsystem performance remained unaffected by independence training, based 
on the lowest correlation observed in our experiments, we could at most expect a 
reduction of the required data amount by one order of magnitude. 


5 Conclusion and Outlook 


Summary and main take-away messages: In this chapter, we argued that obtaining 
statistical evidence from brute-force testing leads to the requirement of infeasible 
amounts of data. In Sects.2 and 3, we estimated upper and lower bounds on the 
amount of data required to test a given perception system with statistical validity 
for being significantly safer than a human driver. We restricted our considerations 
to fatalities and arrived at data amounts that are infeasible from both storage and 
labeling cost perspectives, as already found in [KP 16]. In Sect.3.1, we showed that 
perfectly uncorrelated redundant AI perception systems could be used to resolve 
the problem. However, as we have seen in Sect. 3.2, redundant subsystems that are 
not perfectly uncorrelated require extremely low correlation coefficients between 
error events produced by neural networks to yield substantial reductions in data 
requirement. In Sect.3.3, we furthermore found the amount of data needed to prove 
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a sufficiently low correlation as large as the amount of data needed for the original 
test. 

In Sect.4, we present numerical results on the correlation of redundant subsys- 
tems. We studied correlation coefficients between error events of redundant neu- 
ral networks dependent on network architecture, training data, and weight initializ- 
ers. Furthermore, we trained ensembles of neural networks to become decorrelated. 
Besides studying correlation coefficients, we considered the system’s performance in 
terms of 1-out-of-n as well as k-out-of-n performance. In the numerical experiments, 
we obtained correlation coefficients that would allow for a reduction in the amount of 
data required by at most one order of magnitude. For the testing problem, this would 
still represent an infeasible data requirement. However, redundancy could contribute 
to a moderate reduction of the amount of data required for testing and potentially be 
combined with other approaches. 

There are alternative approaches to test the safety of perception systems other 
than brute-force testing. Here, we give a short outlook on two of them that are very 
actively developed in the research community. 


Outlook on testing with synthetic data: One possibility to obtain vast amounts 
of data for testing is to consider synthetic sources of data. The question of whether 
synthetic data can be used for testing has already been addressed, e.g., in [RBK+21]. 
In principle, arbitrary amounts of test data can be created from a driving simulation 
such as CARLA [DRC+17]. The domain shift between synthetic and real data can be 
bridged with generative adversarial networks (GANs). The latter have been proven 
to be learnable in the large sample limit [BCST20, AGLR21], meaning that for 
increasing amounts of data and capacity of the generator and discriminator, the 
learned map from synthetic to real data is supposed to converge to the true one. 
Combining synthetic data and adversarial learning is therefore a promising candidate 
for testing DNNs. However, in this setup there remain other gaps to be bridged 
(number of assets, environmental variability, and infeasibility of the empirically 
risk-minimizing generator). 


Outlook on testing with real data: There exists a number of helpful approaches 
to estimating the performance of unlabeled data. A DNN of strong performance (or 
ensembles of those) can be utilized to compute pseudo ground truth. The model 
to be equipped with an AI perception system can be learned in a teacher-student 
fashion [AHKS19, BHSFs19], and the discrepancies between the student model 
and the teachers can be compared with errors on a moderate subset with ground 
truth, for instance, in terms of correlations of errors and discrepancies. Furthermore, 
in order to process the vast amounts of recorded data and perform testing more 
efficiently, well-performing uncertainty quantification methods [KG17, KRPM+18, 
RCH+20] in combination with corner case detection methods [BBLFs19, HBR+21] 
can help to pre-select data for testing. Besides that, many additional approaches 
toward improving the reliability of DNN exist. However, while a big number of tools 
already exist, their proper application to DNN testing and inference of statistically 
relevant statements on the system’s safety still require thorough research. 
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These approaches and other upcoming research might be part of the solution to 
the testing problem in future. 


Concluding remark: As a concluding remark, this chapter does not intend to dis- 
courage safety arguments, as also conceptualized in this volume. We do not deny 
the value of empirical evidence in order to, e.g., prove the relative superiority of 
one AI system over another with regards to safety. The inherent difficulty to provide 
direct evidence for the better-than-human safety of automated driving as required by 
the ethics committee of the German Ministry of Transportation and Digital Infras- 
tructure [FBB+17] should not be mistaken as an excuse for a purely experimental 
approach. Bringing automated vehicles to the street without prior risk assessment 
implies that risks would be judged a posteriori based on the experience with a large 
fleet of automated vehicles. 

Such matters attain urgency in the light of recent German legislation on the experi- 
mental usage of automated driving under human supervision? which only refers to the 
technical equipment for the sensing of automated vehicles, but does not specify the 
minimal performance for the AI-based perception based on the sensor information. 
Related regulations in other countries face similar problems.° 

The debate on how to ensure a safe transition to automated driving that complies 
with high ethical standards, therefore, remains of imminent scientific and public 
interest. 


Acknowledgements The authors thank Lina Haidar for support and numerical results in Section 
4.2. Financial support by the German Federal Ministry of Economic Affairs and Energy (BMWi) 
via Grant No. 19A19005R as a part of the Safe AI for Automated Driving consortium is gratefully 
acknowledged. 


References 


[AGLR21] H. Asatryan, H. Gottschalk, M. Lippert, M. Rottmann, A convenient infinite 
dimensional framework for generative adversarial learning, pp. 1-29 (2021). 
arXiv:2011.12087 

[AHKS19] S. Abbasi, M. Hajabdollahi, N. Karimi, S. Samavi, Modeling teacher-student 
techniques in deep neural networks for knowledge distillation, pp. 1-6 (2019). 
arXiv:1912.13179 

[BBLFs19] J.-A. Bolte, A. Bar, D. Lipinski, T. Fingscheidt, Towards corner case detection for 
autonomous driving, in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 
pp. 438-445, Paris, France (2019) 

[BCST20] G. Biau, B. Cadre, M. Sangnier, U. Tanielian, Some theoretical properties of GANs. 
Ann. Stat. 48(3), 1539-1566 (2020) 

[BHSFs19] A. Bar, F. Hiiger, P. Schlicht, T. Fingscheidt, On the robustness of redundant teacher- 
student frameworks for semantic segmentation, in Proceedings of the IEEE/CVF 


5 https://dserver.bundestag.de/btd/ 19/274/1927439.pdf. 


6 Framework for Automated Driving System Safety, No. NHTSA-2020-0106, 49 CFR Part 571 
(Nov. 19, 2020). 


Does Redundancy in AI Perception Systems Help to Test ... 105 


[BTLFs20] 


[BTLFs21] 


[C+15] 


[CATvS17] 


[DRC+17] 


[FBB+17] 


[Ger20] 


[HBR+21] 


[HS16] 
[KB15] 


[KG17] 


[KP16] 


[Kri09] 


[KRPM+18] 


[LCB98] 
[LL88] 


[LWC+19] 


[ME14] 


[Nat20] 


Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 
1380-1388, Long Beach, CA, USA (2019) 

J. Breitenstein, J.-A. Termöhlen, D. Lipinski, T. Fingscheidt, Systematization of cor- 
ner cases for visual perception in automated driving, in Proceedings of the IEEE 
Intelligent Vehicles Symposium (IV), pp. 1257-1264, Virtual Conference (2020) 

J. Breitenstein, J.-A. Termohlen, D. Lipinski, T. Fingscheidt, Corner cases for visual 
perception in automated driving: some guidance on detection approaches, pp. 1-8 
(2021). arXiv:2102.05897 

F. Chollet, et al. Keras (2015). Accessed 18 Nov 2021 

G. Cohen, S. Afshar, J. Tapson, A. van Schaik, EMNIST: an extension of MNIST to 
handwritten letters, pp. 1-10 (2017). arXiv:1702.05373 

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, V. Koltun, CARLA: an open urban 
driving simulator, in Proceedings of the Conference on Robot Learning CORL, pp. 
1-16, Mountain View, CA, USA (2017) 

U. di Fabio, M. Broy, R.J. Briingger, U. Eichhorn, A. Grunwald, D. Heckmann, E. 
Hilgendorf, H. Kagermann, A. Losinger, M. Lutz-Bachmann, A. Markl, K. Miiller, K. 
Nehm, Ethik-Kommission - Automatisiertes und vernetztes Fahren. Technical report, 
Federal Ministry of Transport and Digital Infrastructure (2017) 

German Federal Ministry for Transortation & Infrastructure. Verkehr in Zahlen 
2020/2021. Deutsches Zentrum fiir Luft- und Raumfahrt (2020) 

F. Heidecker, J. Breitenstein, K. Rösch, J. Lohdefink, M. Bieshaar, C. Stiller, T. 
Fingscheidt, B. Sick, An application-driven conceptualization of corner cases for 
perception in highly automated driving, pp. 1-8 (2021). arXiv:2103.03678 

J. Hedderich, L. Sachs, Angewandte Statistik (in German) (Springer, Berlin, 2016) 
D.P. Kingma, J. Ba, ADAM: a method for stochastic optimization, in Proceedings 
of the International Conference on Learning Representations (ICLR), pp. 1-15, San 
Diego, CA, USA (2015) 

A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning for 
computer vision? in Proceedings of the Conference on Neural Information Processing 
Systems (NIPS/NeurIPS), pp. 5574-5584, Long Beach, CA, USA (2017) 

N. Kalra, S.M. Paddock, Driving to safety: how many miles of driving would it take 
to demonstrate autonomous vehicle reliability? Transp. Res. Part A: Policy Pract. 94, 
182-193 (2016) 

A. Krizhevsky, Object classification experiments. Technical Report, Canadian Insti- 
tute for Advanced Research (2009) 

S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J.R. Ledsam, K. Maier-Hein, 
S.M. Ali Eslami, D. Jimenez Rezende, O. Ronneberger, A probabilistic U-Net for 
segmentation of ambiguous images, in Proceedings of the Conference on Neural 
Information Processing Systems (NIPS/NeurIPS), pp. 6965-6975, Montréal, QC, 
Canada (2018) 

Y. LeCun, C. Cortes, C.J.C. Burges, The MNIST database of handwritten digits 
(1998). Assessed 18 Nov 2021 

G.R. Loftus, E.F. Loftus, Essence of Statistics (Brooks/Cole Publishing Co, Salt Lake 
City, 1988) 

L. Liu, W. Wei, K.-H. Chow, M. Loper, E. Gursoy, S. Truex, Y. Wu, Deep neural 
network ensembles against deception: ensemble diversity, accuracy and robustness, 
in Proceedings of the IEEE International Conference on Mobile Ad Hoc and Sensor 
Systems (MASS), pp. 274-282, Monterey, CA, USA (2019) 

W.Q. Meeker, L.A. Escobar, Statistical Methods for Reliability Data (Wiley, New 
York, 2014) 

National Transportation Safety Board NSTB. Collision Between a Sport Utility 
Vehicle Operating With Partial Driving Automation and a Crash Attenuator (2020). 
Accessed 18 Nov 2021 


106 H. Gottschalk et al. 


[RBK+21] J. Rosenzweig, E. Brito, H.-U. Kobialka, M. Akila, N.M. Schmidt, P. Schlicht, J.D. 
Schneider, F. Hiiger, M. Rottmann, S. Houben, T. Wirtz, Validation of simulation- 
based testing: bypassing domain shift with label-to-image synthesis, pp. 1-8 (2021). 
arXiv:2106.05549 

[RCH+20] M. Rottmann, P. Colling, T.-P. Hack, R. Chan, F. Hiiger, P. Schlicht, H. Gottschalk, 
Prediction error meta classification in semantic segmentation: detection via aggre- 
gated dispersion measures of softmax probabilities, in Proceedings of the Interna- 
tional Joint Conference on Neural Networks (IJCNN), pp. 1—9, Virtual Conference 
(2020) 

[Sch19] F.A. Schmidt, Crowdproduktion von Trainingsdaten: Zur Rolle von Online-Arbeit 
beim Trainieren autonomer Fahrzeuge, vol. 417 of Study (in German) (Hans-Böckler- 
Stiftung, 2019) 

[WLX+20] Y. Wu, L. Liu, Z. Xie, J. Bae, K.-H. Chow, W. Wei, Promoting high diversity ensemble 
learning with ensemblebench, in Proceedings of the IEEE International Conference 
on Cognitive Machine Intelligence (CogMI), pp. 208-217, Atlanta, GA, USA (2020) 

[Zac92] S. Zacks, Introduction to Reliability Analysis: Probability Models and Statistical 
Methods (Springer, Berlin, 1992) 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Analysis and Comparison of Datasets by R) 
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Abstract Automated driving is widely seen as one of the areas, where key innova- 
tions are driven by the application of deep learning. The development of safe and 
robust deep neural network (DNN) functions requires new validation methods. A core 
insufficiency of DNNs is the lack of generalization for out-of-distribution datasets. 
One path to overcome this insufficiency is through the analysis and comparison of 
the domains of training and test datasets. This is important because otherwise, deep 
learning cannot advance automated driving. Variational autoencoders (VAEs) are 
able to extract meaningful encodings from datasets in their latent space. This chapter 
examines various methods based on these encodings and presents a broad evalua- 
tion on different automotive datasets and potential domain shifts, such as weather 
changes or new locations. The used methods are based on the distance to the nearest 
neighbors between datasets and leverage several network architectures and metrics. 
Several experiments with different domain shifts on different datasets are conducted 
and compared with a reconstruction-based method. The results show that the pre- 
sented methods can be a promising alternative to the reconstruction error for detecting 
automotive-relevant domain shifts between different datasets. It is also shown that 
VAE loss variants that focus on a disentangled latent space can improve the stability 
of the domain shift detection quality. Best results were achieved with nearest neighbor 
methods using VAE and Joint VAE, a VAE variant with a discrete and a continuous 
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latent space, in combination with a metric based on the well-known z-score, and 
with the NVAE, a VAE variant with optimizations regarding reconstruction quality, 
in combination with the deterministic reconstruction error. 


1 Introduction 


Artificial intelligence (AI) and deep learning are key enablers in automated driv- 
ing [RMM21]. One important task in the development of automated cars is the 
perception of the vehicle environment. While several sensors can be used to solve 
this task, the most promising approaches in industry and research rely at least partly 
on camera-based perception [RNFP19]. In the perception sub-tasks of object detec- 
tion and semantic segmentation, deep learning-based approaches already outperform 
classical methods [GLGL18, ZZXW19]. 

However, those deep learning-based approaches suffer from several insufficien- 
cies. Following Samann et al., one central insufficiency is the ability to generalize 
from given training data [SSH20]. If the underlying distribution of the input data 
differs from the distribution of the training dataset, the prediction accuracy may 
decrease. 

In the context of automated driving, a decrease in prediction accuracy can have 
dramatic consequences, e.g., fatal accidents due to traffic participants which were 
not detected, e.g., Uber accident [SSWS19]. One possibility to decrease the proba- 
bility of these consequences is the optimization of used datasets. A high coverage 
over all possible situations in the training data is crucial to improve the quality of 
trained neural networks. However, since the input spaces (e.g., image data) are very 
high-dimensional and complex, the rule-based assessment of coverage is not feasi- 
ble [BLO+17]. 

Hence, in this work, the capabilities of variational autoencoders (VAEs) of extract- 
ing meaningful distributions from datasets are used to compare different datasets in 
an unsupervised fashion. Since the VAE works directly with raw data, no assumptions 
over relevant dataset dimensions for the comparison are necessary. Additionally, no 
already trained perception function is necessary. The underlying data distributions 
found by the VAE are analyzed to find similarities and novelties, using the latent 
space of the VAE. This makes it possible to estimate whether a particular dataset 
is in-domain or out-of-domain of a second dataset, thus allowing the detection of 
domain shifts between datasets. 

During early function development, the presented methods can be used to extract 
datasets with manageable sizes, still maintaining high coverages over relevant infor- 
mation in the complete data pool. These smaller datasets lead to faster training and 
validation and allow faster iterations. Moreover, the method can detect an unknown 
domain shift between the training, validation, and test datasets, which could lead 
to wrong conclusions about the performance of the function. In later development 
stages, the dataset comparison enables a prediction, if an already trained function 
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can be safely used in a different environment. Additionally, for domain shift detec- 
tions with very high confidences, an online parallel execution of a perception neural 
network is thinkable, acting as a warning system for possibly unknown input data. 

A domain shift can be caused by a huge variety of factors, some of them potentially 
unknown to the developer. This makes the application of rule-based approaches 
impossible. So, the idea of the presented dataset comparison method is to learn a 
multivariate distribution for all relevant aspects of one dataset and find useful metrics 
to compare these distributions. The used metrics in this investigation are related to 
known novelty detection techniques for VAEs. 

The main contribution of this chapter is the application of various novelty detec- 
tion methods and various VAE architectures in several experiments, each testing a 
different kind of domain shift. The applied novelty detection methods comprise five 
different nearest neighbor approaches applied to the latent spaces. As a benchmark, 
the reconstruction error is also computed [VGM+20]. The applicability of these meth- 
ods is investigated on three datasets, each providing one or more inherent domain 
shifts (e.g., weather, location, or daytime). The presented broad evaluation of differ- 
ent approaches can give insights into the practical application of such approaches in 
the development of safe perception functions for automated driving. 


2 Related Works 


In this section, we provide an overview of related topics. There are various ways to 
useVAEs for anomaly detection in datasets [AC15, SWXS18, VGM+20]. An et al. 
compute the reconstruction probability of a given test sample by inspecting several 
samples drawn from the latent distribution of test samples [AC15]. The underlying 
assumption is that anomalous data samples have a higher variance and will therefore 
show a lower reconstruction quality. Vasilev et al. propose several methods and 
metrics to compute novelty scores using VAEs [VGM+20]. They describe 19 different 
methods for novelty calculation, including latent space-based and reconstruction- 
based approaches. Their evaluation on a magnetic resonance imaging dataset and 
on MNIST showed that a variety of metrics outperforms classical approaches such 
as nearest neighbor distance or principal component analysis (PCA) in the feature 
space. 

In the domain of skin disease detection, VAE-based novelty detection methods 
were applied by Lu et al. [LX18]. They decompose the VAE loss into the recon- 
struction term and the KL term and use these as a novelty metric. Their results show 
better performance of the reconstruction-based approaches. 

In the automotive domain, the issue of designing meaningful test datasets is 
addressed on various levels. Bach et al. address the selection of well-adjusted datasets 
for testing with a visual method for time series in recorded real-world data [BLO+17]. 
Langner et al. enhanced this approach with a method for dataset design, using the nov- 
elty detection capabilities of classical autoencoders [LBR+18]. A related approach 
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uses methods from natural language processing (NLP) to describe sequential contexts 
in automotive datasets, allowing similarity analysis and novelty detection [RSBS20]. 

These approaches can only indirectly find domain shifts for perception functions, 
since they are not designed to work in the feature space itself. For the detection 
of domain shifts directly in raw data, Lohdefink et al. propose the analysis of the 
reconstruction error of an autoencoder to predict the quality of a semantic segmen- 
tation network [LFK+20]. Sundar et al. address the issue of the diversity of auto- 
motive data by training various G-VAEs using different labels such as weather or 
daytime [SRR+20]. 

There is a lot of research in relation to application domains, proposed metrics 
and network architectures in the area of dataset analysis with VAEs. However, most 
of the related approaches restrict their analysis on datasets with only a few degrees 
of freedom (e.g., [VGM+20] and [LX18]) or test their method on only one type of 
domain shift (e.g., different locations). In this work, a broad spectrum of approaches 
(different network architectures, different metrics) were evaluated on three datasets 
including five different domain shifts. 


3 Building Blocks of Our Approach 


In this work, four different VAE architectures and five different metrics for anomaly 
detection are examined on their applicability to compare automotive datasets. Four 
methods use a nearest neighbor approach in the latent space for detection of domain 
shifts. 

In this section, the VAE DNN is introduced. Afterwards, the used novelty detection 
methods are presented, and the section concludes with the explanation for dataset 
comparison and domain shift detection. 


3.1 Variational Autoencoder DNN 


The difference between the original autoencoder and the VAE is that the VAE uses 
a stochastic model. The VAE consists of a stochastic encoder, a latent space, and a 
stochastic decoder. The general structure of a VAE can be seen in Fig. 1. Simplified, 
one could say that the VAE transfers the input data into a low-dimensional space to 
then reconstruct the input data from this space. The underlying stochastic model is 
now explained. 

The probabilistic generative model of the VAE pg(x|z) is a deep neural network 
(DNN) with network parameters 8 mapping z into x and is named decoder [VGM+20, 
Dup18]. This decoder model learns a joint distribution of the data x and a continuous 
latent variable z € R^! [CHK21]. In other words, the decoder describes the distribu- 
tion of the decoded variable $ given the encoded one z. The inference model of the 
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Fig. 1 Basic variational autoencoder structure, consisting of an encoder to transform data into the 
latent space and a decoder to reconstruct images from sampled values z. The encoder receives an 
image x as input, where H is the height of the image, W is the width of the image, and C is the 
number of color channels 


VAE is trained to approximate the posterior distribution qg(z|x) ~ N (u(x), a(x)) 
with parameters @ mapping x into z and is named encoder [VGM+20]. The encoder 
describes the distribution of the encoded variable given the decoded one. 

The probabilistic generated model and the inference model then give rise to the 
first loss term, since the aim for the input data of the encoder is to be reconstructed by 
the decoder. The reconstruction error loss is defined as follows: Eg, (zx) [log pg (x|z) | 
[Dup 18]. In order to improve the disentanglement properties of q¢(z|x), the VAE has 
a loss optimization that tries to fit qg(z|x) to p(z) [HMP+17]. This loss optimization 
can be both controlling the capacity of the latent information bottleneck and embody- 
ing statistical independence. This is achieved by defining p(z) as an isotropic unit 
Gaussian distribution p(z) = M (0, 7) [HMP+17]. In order to train the disentangle- 
ment in the bottleneck, the VAE uses the Kullback—Leibler (KL) divergence of the 
approximate from the true posterior Dx, (q¢g(z|x)||p(z)). Before we define the loss 
term, we introduce the variable (3, which serves as a weighting factor. This allows us 
to define the loss function for the VAE and 3-VAE as follows: 


J(9, $) = Egyaix [log pg (xlz)] — 2 Drz (do (@lx)||p (2). (1) 


The first term of the loss function is the reconstruction error, which is responsi- 
ble for the VAE encoding the informative latent variable z and allowing the input 
data x to be reconstructed [CHK21]. The second term regulates the posterior dis- 
tribution by adjusting the distribution of the encoded latent variables qg(z|x) to 
the prior p(z) [CHK21]. If 8 > 1 is selected, the loss function of the G-VAE as 
shown in (1) is used, and if 8 = 1 is selected, (1) breaks down to the standard 
VAE loss [HMP+17]. The variable / scales the regularization term for the posterior 
distribution, with G > 1 enforcing the 3-VAE to encode more disentangled latent 
variables by matching the encoded latent variables with the prior by higher pressure 
on Dx (qg(z|x)||p(z)) [HMP+17, CHK21]. 

The VAE and 8-VAE only use continuous latent variables to model the latent 
variable z [CHK21]. The next VAE architecture that we use for our experiments addi- 
tionally uses discrete variables to disentangle the generative factors of the observed 
data [CHK21]. The JointVAE uses the discrete variable c to decode the generative 
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factors of the observed data [Dup18, CHK21]. The loss function of the Joint VAE is 
an extension of the loss function of the G-VAE and results in (2). 


JO, ) = Eya cw log polz, ©)] — y|Dxx(qo(2lx)||p(z)) — C2] 


(2) 

— 7|Drx(qe(elx)||p(©)) — Cel. 
The term Dx7z(q¢(elx)||p(c))) is the Kullback—Leibler divergence for the discrete 
latent variables and Dx; (q¢(z|x)||p(z)) is the Kullback—Leibler divergence for the 
continuous latent variables [Dup18]. The variables y, Cz, and Ce are hyperparame- 
ters [Dup18]. The hyperparameters C, and Ce are gradually increased during train- 
ing and both hyperparameters control the amount of information the model can 
encode [Dup18]. The hyperparameter y forces the KL divergence to match the capac- 
ity Cz and Ce. As with the standard VAE, qg(z, ¢|x) is the inference model which 
maps x into z and c. 

In contrast to G-VAE and JointVAE which are focused on optimized latent space 
distributions, the goal of the Nouveau VAE (NVAE) is to reconstruct the best possible 
images [VK21]. Compared to the other used VAEs, the NVAE is a hierarchical VAE. 
This means that the NVAE has a hierarchical encoder and decoder structure and 
multiple latent spaces on different stages. Another special property is that this VAE 
has a bidirectional encoder [VK21]. This means that the information does not only 
pass from one encoder to the other but also the information from a deeper encoder 
passed back to the higher encoder. A focus during the development of the NVAE was 
to design expressive neural networks for VAEs [VK21]. 

In this chapter, the VAE [KW14], 6-VAE [HMP+17], JointVAE [Dup18], and 
Nouveau VAE (NVAE) [VK21] are considered for the five experiments. 


3.2 Methods for Novelty Detection 


Nearest neighbor approaches for novelty detection have the assumption that the 
distributions of new and unknown data are different from the distributions of normal 
data. So, the latent space distance between a test sample distribution p, and the closest 
normal sample can be used as a novelty score. Since latent spaces consist of high- 
dimensional distributions, the selection of a feasible metric to compute meaningful 
distances is non-trivial. According to Sect. 3.1, the used normal distributions p,, are 
defined by their parameters yz, and o,. As the JointVAE has different latent spaces, 
for the calculation of the nearest neighbor only the continuous latent space was used. 

In this chapter, five metrics in the latent space and one metric in the pixel space are 
evaluated. The five metrics in the latent space can be further divided into two groups. 
Group one includes the metrics that only use the means (jz) of the distribution, and 
metric group two uses both means (jz) and variance (ø) for the calculation. 

First, the metrics from group one are presented that only use means for the cal- 
culation. The cosine distance is a widely used metric to determine the similarity of 
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vectors [NB10, Ye11]. Since it is designed for vectors, it can not directly be applied 
to distributions. Therefore, only the means p of the normal distributions are used for 
calculating the cosine distance. The cosine distance can be computed by 


TATH 


TEETE. (3) 
CAIA 


Mecos (Phn , Py) = 


where u, = (Hn(ô)) and u, = (u: (8)) describe the d-dimensional mean vectors of 
two distributions p,„ and p,, related to a normal and a test sample. 

The Euclidean distance is the distance that can be measured with a ruler between 
two points and is defined as 


d 
Meu Pr P) = |X nO) — H (0). (4) 
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Here and in the following definitions, d is the dimensionality of the latent space. 
The Manhattan distance is inspired by the quadratic layout of Manhattan, New York 
City, and measures the distance between two vectors as the sum of their components’ 
absolute differences [BR14]. 
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The former presented metrics only used the means u. Next, two metrics are intro- 
duced that use both means yp and standard deviations ø for the calculation of 
the nearest neighbor. The next metric was defined by the authors of this chapter, 
using the well-known z-score. From the test samples’ distribution, a vectorial value 
Z; = (z;(0)) is sampled. Then the distance is computed via 
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zı (ô) — Hn (ô) (6) 
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where the difference between z, and the normal sample mean u, being element-wise 
divided by its standard deviation o,,. Since we are only interested in the absolute 
distance of the distributions, we compute the squared L2 norm. The Bhattacharyya 
distance 
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as proposed by Vasilev et al. [VGM+20], is also tested. The Bhattacharyya distance 
measures the similarity of two distributions [CA79]. 
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To consider also a metric which does not use the latent space, the reconstruction 
error in the pixel space is considered. Since the VAE is trained only with normal data, 
the reconstruction quality should decrease, when new data samples are processed. 
Specifically, we chose the deterministic reconstruction error 

5 a2 
Mare (X, X) = [x = 2| > 


(8) 


as proposed by Vasilev et al. [VGM+20], since it yielded the best results in their 
experiments. 


3.3 Dataset Comparison Approach 


In this section, we explain how the methods described in Sect. 3.2 can be leveraged 
to compare the whole datasets by explaining the data pipeline, also visible in Fig. 2. 
As starting points of the pipeline, a training dataset D", a validation dataset DY, 
which is in the distribution of the training dataset, and a test dataset D“°™ are needed. 
DI is the dataset, for which the method tests, whether the dataset is in-domain 
or out-of-domain of the respective training dataset . The presented metrics should 
therefore be able to separate DY! and D*°™. Dataset DY! most probably will have 
single images with higher novelty scores and D“™ can have already known images. 
Therefore, the metrics most probably cannot perfectly separate DY and D“™. This 
is solved by an analysis of the novelty metric results on a dataset level. 

The first step of the pipeline is the training of a VAE with the D" dataset. For the 
novelty detection, the decoder can be discarded after training. As shown in Fig. 2, 
the trained encoder can now be used to generate the latent space representations of 
DE, DY", and D™, spanned by the vectors u", 0, pY"!, a, p*™, and o™, 

Now, the novelty scores provided in Sect.3.2 can be applied to every latent 
space representation of all data samples in D! and D*™. Hence, for the deci- 
sion, whether D®™ has a large gap to DY’, the novelty scores have to be com- 
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Fig. 2 Approach for dataset comparison based on latent space-based novelty scores. After training, 
the encoder is used to transform images into their latent space. In this space, different metrics can 
be applied to calculate nearest neighbor distances between different datasets. Domain shifts can 
then be detected by analyzing the histograms deviations 
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pared. In case of the nearest neighbor approaches, the resulting novelty scores mya, = 
{(val, Wali )» tees (val, Wyal, )} and Mdom = {(dom, ’ Wdom; ); e.’ (dom,, Wdom, )} 
are represented as a histogram where n € N is the number of buckets, val, and 
dom, are bucket representatives with 0 < a,b < n, with a, b € N, and Wyal,, Wdom, 
being the weights of the bucket [RTG00]. To compare the histograms we use the 
earth mover’s distance (EMD) [RTG00]. The earth mover’s distance can calculate 
the difference between the histogram Mya and mdom by calculating the earth amount 
fap Which is to be moved from bucket val, to dom, [RTG00]. To calculate the dis- 
tance, the EMD also needs the ground distance matrix D = (dap), where dap is the 
ground distance between bucket val, and dom, [RTG00]. With these terms, the EMD 
can be defined as follows: 


rant 2 b=1 dab fab 


EM D(mva, Mdom) = 7 7 (9) 
Jacl b=1 fab 

Here, 7", )-}—1 dab fap is the flow between mya and Maom, that minimizes the 

overall cost with the following constraints [RTG00]: 
fa20 l<a<n,l<b<n, (10) 
Yo fab < Wa, 1S a <n, (11) 

b=1 

ŠO fab < Wam, 1 <b <n, (12) 


n n 


Fab = min( Wyala > > ams . (13) 
1 b=1 
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The used histograms are normalized on the x-axis to a range of values between 0 and 
1. Additionally, the y-axis is also normalized, such that the histogram bars sum to 1. 
With this normalization, comparability between different experiments is achieved, 
since the results are independent of absolute metric values and dataset sizes. 

For the reconstruction error, the histogram approach is comparable with the dis- 
tances to the nearest neighbors. The histograms then are defined over the deterministic 
reconstruction error instead of the latent space distance. 


4 Experimental Setup 


In this section, first the selection of the datasets is explained. Afterwards, the exper- 
iments are explained and the hyperparameter values are given. The chapter is con- 
cluded with the training parameters for the experiments. 
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4.1 Datasets 


For the evaluation of our method, we selected different datasets which all have 
different focuses. The first dataset is the Oxford RobotCar Dataset [MPLN17]. In 
this dataset, the same route was driven multiple times over the time period of | year, 
except for a few small deviations. The special feature of this dataset is that different 
weather conditions and seasons were recorded on the same route. This allows various 
domain shift analyses, such as changing weather conditions, e.g., sunny weather vs. 
snow. In the training, validation, snow test, rain test, and night test data splits there 
are 9130, 3873, 4094, 9757, and 9608 images, respectively. For all experiments, the 
test split is created to test whether the data is in-domain or out-of-domain. 

As a second dataset, the Audi autonomous driving dataset (A2D2) [GKM+20] 
was used. This dataset is relatively new and was acquired in different German cities. 
For our evaluation, we compared the images from the city of Gaimersheim with the 
images from the city of Munich. The assumption with this dataset is that it shows 
relatively little variation because it was recorded in two German cities in the state of 
Bavaria. Thus, it should result in a very low dataset distance. The small difference is 
expected because Munich is larger than Gaimersheim, and therefore it is acomparison 
between an urban and a suburban city. The dataset is divided into training dataset of 
12550 images, validation dataset of 3138 images, and test dataset of 5490 images. 

As a third dataset, the nuScenes dataset [CBL+20] was selected. This dataset 
was selected because it was recorded in Boston and Singapore. The interesting point 
in the comparison is that there is right-hand traffic in Boston and left-hand traffic 
in Singapore. A number of 30013 images are used for training, 40152 images for 
validation, and 20512 images for test. 

All datasets consist of a number of unconnected video sequences. To avoid similar 
images in different datasets splits, the split between D", DY}, and D“&™ was done 
sequence-wise. This ensures, that every scene only occurs in exactly one dataset. 
Only for the A2D2 dataset, the split between D€, D“! was inside the Gaimersheim 
sequence, resulting in a few very similar images just before and after the split. 
However, this should not have a big effect, since the similar images take only a 
very small percentage of the whole dataset. 


4.2 Experiments 


For the evaluation of our method, we selected three different datasets, which all have 
a different focus. With these datasets, five different domain shifts were generated. 
These domain shifts can be seen in Table 1. 
The five different domain shifts were investigated with four different VAE variants. 
Each variant was tested with 6 different metrics, leading to 120 experiments in total. 
Experiments 1-3 were performed with the Oxford dataset. For training, images 
were taken on sunny days and as validation data, images were selected that were 
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Table 1 All our datasets and experiments 


Experiment Dataset D" Dataset DY"! Dataset D&™ 
Experiment 1 Oxford Oxford alternate route | Snow 
Experiment 2 Oxford Oxford alternate route | Rain 
Experiment 3 Oxford Oxford alternate route | Night 
Experiment 4 Gaimersheim Gaimersheim (holdout) | Munich 
Experiment 5 Boston Boston (holdout) Singapore 


taken on an alternative route under the same weather conditions. This results in the 
following experiments for the Oxford dataset. Experiment 1 tests if the latent space 
can be used to distinguish sunny days from snowy days. The second experiment tests 
how rain can be distinguished from sun, and the third experiment investigates how 
sunny days differ from night scenes (Fig. 3). 

Experiment 4 uses the A2D2 dataset. This experiment only has a very small 
domain shift, since we compare data recorded in the sub-urban area of the small 
German city Gaimersheim compared to the urban area of Munich. Experiment 5 
uses the nuScenes dataset. This experiment used images from Boston as training 
data and, as in Experiment 4, used a holdout of the data as validation data. Images 
from Singapore were used as a test set. Since in Singapore also night scenes were 
recorded, it has been ensured that no night scenes are in the test dataset. 

To allow comparison of the experiments and metrics, baseline experiments were 
conducted in addition to the domain shift experiments. This is necessary to find out 
which value the metric has for in-domain data. For this purpose, the validation dataset 
D“! was divided into DY! and DY! , This makes it possible to define a baseline for 
each experiment by setting DY! = DY“ and D&™ = DY, 


4.3 VAE Hyperparameters 


To ensure comparability between the VAE, 3-VAE, and JointVAE, we used the same 
layer architecture and hyperparameters. The goal of this chapter is to be able to give 
an initial indication of whether a domain shift can be detected with simply chosen 
hyperparameters. It is to be expected, that with optimization of hyperparameters on 
multiple datasets, better results can be achieved. 

To evaluate the performance of the presented methods, we have used the standard 
VAE, 3-VAE, JointVAE, and the NVAE. The architecture was the same for the VAE, 
GB-VAE, and the JointVAE. We use five hidden layers with dimensions 32, 64, 128, 
256, and 512. After the hidden layers, we add three residual blocks. The residual 
blocks follow a latent space with 100 neurons. Both ( and y were set to 10.5 and the 
input size of the images was specified to a format of 512 x 256. For the NVAE, we 
adopted the architecture from the paper [VK21]. 
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Fig. 3 These histograms show the results for Experiment 2 and the JointVAE with the z-score and 
cosine distance. Training was done with the sunny Oxford images (oxford sun), and it was tested 
if rainy images (oxford rain) are out-of-domain. You can clearly see that the histogram on the left 
shows a difference between rain and sun, and the histograms on the right are rather on top of each 
other and poorly separated 


4.4 Training 


The VAE models used for the domain shift analysis were trained and implemented 
using Pytorch [PGC+17]. All VAE architectures were trained for 50 epochs with 
a batch size of 128 images. In addition, a learning rate of 0.0001 in combination 
with the Adam optimizer [KB15] was used. Figure 4 shows the original image and 
the reconstruction of the different VAEs. The selected image is from the validation 
dataset. 


5 Experimental Results and Discussion 


In this section, the results are presented. Section 5.1 first explains the results with the 
nearest neighbor method and after that in Sect.5.2 the results will be discussed. 


5.1 Results 


To make the results of the different architectures and the different metrics comparable, 
we calculated the ratio of the in-domain data with the out-of-domain data for each 
experiment and architecture. This is done by 
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(a) original (b) VAE 


(c) B-VAE (d) JointVAE 


(e) NVAE 


Fig. 4 Figure a shows an image from the Oxford validation dataset DY! Figures b-e show the 
reconstruction of the VAE architectures used in this chapter. It is clear that the NVAE has the best 
reconstruction quality. As expected, the reconstruction of the VAE (b), G-VAE (c), and JointVAE 
(The VAEmodelsu) is worse compared to the NVAE. However, there are differences in reconstruction 
quality of (b)—(d), e.g., in the reconstruction of the JointVAE, compared to the VAE and 6-VAE, 
you can imagine the car on the right edge. However, the fact that the reconstruction is worse is not 
a problem, because the focus of our method lies on the latent space 


EMD(D™, Doom) 
EM D(Dvlo , Dvali) 


Qemp = (14) 


and leads to Qgyp > 1 for a detected domain shift. The intuition behind this is 
that the earth mover’s distance (meaning the distribution of latent space distances) 
between two in-domain datasets has to be very low, whereas the latent space distances 
to out-of-domain data should be higher. The results for the Oxford data can be seen 
in Tables 2, 3, and 4. It can be seen that the NVAE achieves the best results in all 
experiments. The reconstruction error works with the NVAE in all three experiments, 
even if the z-score and the NVAE work slightly better in the rain images. 

The values in the A2D2 experiment are lower compared to the Oxford values. This 
was expected, as the domain shift is significantly smaller compared to the Oxford 
domain shifts. Different architectures and metrics are close to each other in this 
experiment. The VAE with the z-score has the best result. 

In the nuScenes dataset, the ratio is higher than in the A2D2 experiment. For this 
experiment, different combinations are again nearly the same, but the VAE with the 
Manhattan distance is the best. 

An important aspect of the experiment results is the stability of results, meaning 
a combination of architecture and metric should be able to correctly detect every 
kind of domain shift. For most combinations, this is not fulfilled, as can be seen 
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Table 2 The result of Q £mp (14) from Oxford alternate route and DI™ from Oxford snow 


Architecture | Bhattacharyya | Cosine | Euclidean | Manhattan | Z-score | Reconstruction 
distance distance | distance distance error 

VAE 1.18 1.49 0.52 0.47 1.01 1.32 

B-VAE 7.29 4.12 1.03 0.85 1.95 1.10 

JointVAE 3.77 0.50 0.37 0.45 3.00 1.25 

NVAE 1.57 1.29 1.23 1.33 0.80 8.92 


Table 3 The result of Q £mp (14) from Oxford alternate route and Dm from Oxford rain 


Architecture | Bhattacharyya | Cosine | Euclidean | Manhattan | Z-score | Reconstruction 
distance distance | distance distance error 

VAE 1.45 0.77 1.00 1.06 1.01 1.22 

B-VAE 1.88 1.24 0.76 0.89 1.72 1.14 

JointVAE 1.27 0.68 0.97 1.10 2.52 1.25 

NVAE 2.14 0.86 6.08 3.80 5.40 5.00 


Table 4 The result of Q gmp (14) from Oxford alternate route and D4™ from Oxford night 


Architecture | Bhattacharyya | Cosine | Euclidean | Manhattan | Z-score | Reconstruction 
distance distance | distance distance error 

VAE 1.60 1.86 1.55 1.68 1.45 5.88 

B-VAE 1.00 2.06 0.69 0.81 1.76 6.74 

JointVAE 2.27 1.91 1.54 1.58 2.72 5.83 

NVAE 0.00 0.43 0.85 0.60 0.60 10.42 


Table 5 The result of Q £mp (14) from Gaimersheim (holdout) and D®™ from Munich 


Architecture | Bhattacharyya | Cosine | Euclidean | Manhattan | Z-score | Reconstruction 
distance distance | distance distance error 

VAE 0.61 0.41 0.72 0.80 1.76 0.65 

B-VAE 0.54 0.28 0.40 0.42 0.68 0.85 

JointVAE 0.48 0.32 0.55 0.74 1.31 0.61 

NVAE 1.29 1.70 1.31 1.69 1.25 1.57 


in Tables 2, 3, 4, 5, and 6. One example of this is the VAE with the Bhattacharyya 
distance: A combination with values over one in four of the five experiments is given 
(Tables 2, 3, 4, 6), while failing to detect a domain shift for the A2D2 experiment in 


Table 5. 
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Table 6 The result of Q £mp (14) from Boston (holdout) and D°™ from Singapore 


Architecture | Bhattacharyya | Cosine | Euclidean | Manhattan | Z-score | Reconstruction 
distance distance | distance distance error 

VAE 3.16 2.49 4.39 4.78 2.50 1.86 

3-VAE 1.34 2.42 4.38 4.03 1.87 1.81 

JointVAE 1.05 2.19 4.69 4.52 2.97 1.79 

NVAE 0.76 2.69 1.43 4.43 1.07 1.32 


5.2 Discussion 


Since this chapter aims to provide insights into the applicability of latent space 
methods for dataset comparison, some central findings are presented here. One goal 
was to use various experiments to find a combination that finds all domain shifts, 
if possible, irrespective of the extent of the domain shift. To achieve the goal, dif- 
ferent latent space methods were used, as there are critics who say that a nearest 
neighbor search does not work in high-dimensional space [AHKO1]. In the work 
of Goos et al. [AHKO1], it was said that the Euclidean distance does not work in 
high-dimensional spaces as well, and that the Manhattan distance achieves better 
results. For the NVAE, the latent space dimension of 4096 is significantly higher 
than for the other VAE architectures with 100 dimensions. The NVAE has this high 
number of latent space dimensions because we used the parameters from the NVAE 
paper [VK21]. The parameters were adopted to achieve the best possible reconstruc- 
tion without hyperparameter optimization runs. Our results show a similar pattern. 
The Manhattan distance can work better than the Euclidean distance. However, both 
metrics have the problem that they do not always give reliable results and have 
multiple values less than one in the ratio calculation. In addition to investigating 
whether the nearest neighbor method works, we also wanted to find an architecture 
and metric that works stable, meaning that they are able to correctly detect all tested 
domain shifts. Three architectures and two metrics were found that work and are 
shown graphically in Fig. 5. The three architectures found have in common that they 
do not have a ratio with a value smaller than one in all experiments. It can be seen 
that the NVAE with the reconstruction error works best with strong domain shifts, 
resulting in large visual differences between images such as rain, snow, and night. If 
the domain shifts consist of different locations, resulting in a variety of only small 
dataset differences, the z-score with the VAE or JointVAE works better compared 
to the NVAE with the reconstruction error. Due to the additional categorical latent 
space of the JointVAE, the discrete latent space is built up better. This leads to the 
JointVAE functioning more stable with the z-score than the VAE. The VAE is sig- 
nificantly worse with the Oxford dataset and nuScenes. However, in rain and snow, 
the VAE works not so good, as can be seen in Fig. 5. The ratio 1.01 is so close to the 
in-domain/out-of-domain boundary that one can say that the VAE should rather not 
be used for weather domain shifts. The fact that the JointVAE works better than the 
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Fig. 5 In this plot, the stability of the three combinations of VAE and metrics can be seen. These 
three combinations have a value greater than one for each of the five experiments and therefore give 
stable correct results. The red dashed line indicates the in-domain/out-of-domain boundary 


Table 7 The training time of the different VAE architectures with the nuScenes dataset (30013 
images). Training was done on an Nvidia A100 with 50 epochs. Additionally, we also report 
the time needed to transform an image into the latent space 


Architecture Training Time (hh:mm:ss) Image to Latent Space (in 
seconds) 

VAE 00:49:17 0.0022 

3-VAE 00:50:16 0.0019 

JointVAE 00:49:36 0.0022 

NVAE 03:31:45 0.0101 


VAE may not only be due to the latent space but also to the loss, which has more pres- 
sure on the latent spaces and is thus better structured for a nearest neighbor search. 
In Fig. 5, the G-VAE is missing. This is due to the A2D2 dataset, where the domain 
shift was not detected with each of the metrics. This could maybe be improved with 
a better choice of 3. The optimal parametrization of each network structure was 
beyond the scope of this chapter. As the results show, at least in some experiments, 
the NVAE outperforms the other approaches. However, due to the more complex 
architecture, the duration of one training is four times higher than for the other archi- 
tectures (see Table 7). Also, the inference time, here meaning the time to execute the 
encoder part, is significantly longer with factors between 4 and 5. This aspect should 
definitely be considered, since a practical application of such approaches probably 
would implicate frequent retraining and encoding of large amounts of images. 

In summary, the JointVAE with the z-score works as a fast alternative for the 
NVAE with the reconstruction error. 
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6 Conclusions 


This chapter investigated whether the latent space of a VAE can be used to find 
domain shifts in the (image) context of the automotive industry. One goal was to 
find a combination of metrics and VAE architectures that work in general, i.e., for 
different datasets and domain shifts. 

The novelty of our chapter is the different types of VAE used to evaluate the shift 
of image data in different datasets and under different conditions. 

To find out which combination is the most stable and best compared to the others, 
we conducted 120 experiments. In these experiments, four different VAE architec- 
tures were trained with three different datasets and five different domain shifts were 
evaluated. The VAE architectures differ in that three of them have the same layer 
structure and only the loss changes, and one architecture focuses on reconstructing 
the images. 

Six different metrics were used to detect the domain shifts, including the recon- 
struction error and four different nearest neighbor searches. 

The most stable results were obtained with the JointVAE and the z-score, and 
with the NVAE and the reconstruction error. With “stable results” we mean that 
some ratio of earth movers distances between the histograms of the nearest neighbor 
distances for in-domain and domain shift data is greater than one for all experiments. 
The VAE with the z-score was also above one in all experiments, but in the Oxford 
dataset, it was so minimally above one that this combination of metric and VAE 
cannot be recommended without limitations. In the experiments, the combination 
JointVAE and z-score performed better for more subtle domain shifts like the change 
of location, whereas the combination NVAE with reconstruction error performed 
better for visually present domain shifts like snow or rain. 

The results of the experiments show that JointVAE with the latent spatial metric 
can be an alternative to NVAE with the reconstruction error. 

In the current state, it is possible to support the function development with the 
selection of the best out-of-domain data. This selection can help to select useful data 
subsets with all relevant data present. 

Better results can be expected if the function for which domain shifts are being 
searched is included in the training. If the information from the function, e.g., inner 
representations of a neural network is used to build the latent space, a performance 
drop of this neural network can probably be predicted better. However, this approach 
would be no longer function-agnostic, since it requires an initially trained function. 

As a conclusion, the presented methods can be a useful tool to compare and 
optimize datasets with only few assumptions and can therefore be one component 
in a strategy to develop functions with better generalization capabilities. Yet, for 
a complete approach, it has to be complemented with other methods such as rule- 
based or function-specific dataset optimizations to mutually compensate each other’s 
weaknesses. 
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Abstract Synthetic, i.e., computer-generated imagery (CGI) data is a key compo- 
nent for training and validating deep-learning-based perceptive functions due to its 
ability to simulate rare cases, avoidance of privacy issues, and generation of pixel- 
accurate ground truth data. Today, physical-based rendering (PBR) engines simulate 
already a wealth of realistic optical effects but are mainly focused on the human 
perception system. Whereas the perceptive functions require realistic images mod- 
eled with sensor artifacts as close as possible toward the sensor, the training data 
has been recorded. This chapter proposes a way to improve the data synthesis pro- 
cess by application of realistic sensor artifacts. To do this, one has to overcome the 
domain distance between real-world imagery and the synthetic imagery. Therefore, 
we propose a measure which captures the generalization distance of two distinct 
datasets which have been trained on the same model. With this measure the data syn- 
thesis pipeline can be improved to produce realistic sensor-simulated images which 
are closer to the real-world domain. The proposed measure is based on the Wasser- 
stein distance (earth mover’s distance, EMD) over the performance metric mean 
intersection-over-union (mIoU) on a per-image basis, comparing synthetic and real 
datasets using deep neural networks (DNNs) for semantic segmentation. This mea- 
sure is subsequently used to match the characteristic of a real-world camera for the 
image synthesis pipeline which considers realistic sensor noise and lens artifacts. 
Comparing the measure with the well-established Fréchet inception distance (FID) 
on real and artificial datasets demonstrates the ability to interpret the generalization 
distance which is inherent asymmetric and more informative than just a simple dis- 
tance measure. Furthermore, we use the metric as an optimization criterion to adapt a 
synthetic dataset to a real dataset, decreasing the EMD distance between a synthetic 
and the Cityscapes dataset from 32.67 to 27.48 and increasing the mIoU of our test 
algorithm (DeeplabV3+) from 40.36 to 47.63%. 
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1 Introduction 


Validation of deep neural networks (DNNs) is increasingly resorting toward 
computer-generated imagery (CGI) due to its mitigation of certain issues. First, syn- 
thetic data can avoid privacy issues found with recordings of members of the public 
and, on the other hand, can automatically produce vast amounts of data at high qual- 
ity with pixel-accurate ground truth data and reliability than costly manually labeled 
data. Moreover, simulations allow synthesis of rare cases and the systematic variation 
and explanation of critical constellations [SGH20]—a requirement for validation of 
products targeting safety-critical applications, such as automated driving. Here, the 
creation of corner cases and scenarios which otherwise could not be recorded in a 
real-world scenario without endangering other traffic participants is the key argument 
for the validation of perceptive AI with synthetic images. 

Despite the advantages of CGI methods, training and validation with synthetic 
images still have challenges: Training with these images does not guarantee a similar 
performance on real-world images and validation is only valid if one can verify 
that the found weaknesses in the validation do not stem from the synthetic-to-real 
distribution shift seen in the input. 

To measure and mitigate this domain shift, metrics have been introduced with 
various applications in the field of domain adaptation or transfer learning. In domain 
adaptation, the metrics such as FID, kernel inception distance (KID), and maxi- 
mum mean discrepancy (MMD) are applied to train generative adversarial networks 
(GANSs) to adapt on a target feature space [PTK Y09] or to re-create the visual prop- 
erties of a dataset [SGZ+16]. However, the problem of training and validation with 
synthetic imagery is directly related to the predictive performance of a perception 
algorithm on the target data, and these kinds of metrics struggle to correlate with 
the predictive performance [RV 19]. Additionally, applications of domain adaptation 
methods often resort to specifically trained DNNs, e.g., GANs, which adapt one 
domain to the other and therefore add an extra layer of complexity and uncontrol- 
lability. This is especially unwanted if a validation goal is tested, e.g., to detect all 
pedestrians, and the domain adaption by a GAN would add additional objects into 
the scene (e.g., see [HTP+18]) making it even harder to attribute detected faults of 
the model to certain specifics of the tested scene. Here, the creation of images via 
a synthesis process allows to understand domain distance influence factors more 
directly as all parameters are under direct control. 

Camera-recorded images inherently show visual imperfections or artifacts, such 
as sensor noise, blur, chromatic aberration, or image saturation, as can be seen in an 
image example from the A2D2 [GKM+20] dataset in Fig. 1. CGI methods, on the 
other hand, are usually based on idealized models; for example, the pinhole camera 
model [Stu14] which is free of sensor artifacts. 

In this chapter, we present an approach to decrease the domain divergence of syn- 
thetic to real-world imagery for perceptive DNNs by realistically modeling sensor 
lens artifacts to increase the viability of CGI for training and validation. To achieve 
this, we first introduce a model of sensor artifacts whose parameters are extracted 
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Fig. 1 Real-world images (here A2D2) exhibit sensor lens artifacts which have to be closely 
modeled by an image synthesization process to decrease the domain distance of synthetic to real- 
world datasets to make them viable for training and validation 


from a real-world dataset and then apply it on a synthetic dataset for training and 
measuring the remaining domain divergence via validation. Therefore, a new inter- 
pretation of the domain divergence by generalization of the distance of two datasets 
by the per-image performance comparison over a dataset utilizing the Wasserstein or 
earth mover’s distance (EMD) is presented. Next, we demonstrate how this model is 
able to decrease the domain divergence further by optimization of the initial extracted 
sensor camera simulation parameters as depicted in Fig. 6. Additionally, we compare 
our results with randomly chosen parameters as well as with randomly chosen and 
optimized parameters. Last, we strengthen the case for the usability of our EMD 
domain divergence measure by comparison with the well-known Fréchet inception 
distance (FID) on a set of real-world and synthetic datasets and highlight the advan- 
tage of our asymmetric domain divergence against the symmetric distance. 


2 Related Works 


This chapter is related to two areas: domain distance measures, as used in the field 
of domain adaptation and synthetic data generation for training and validation. 


Domain distance measures: A key challenge in domain adaptation approaches is 
the expression of a distance measure between datasets, also called domain shift. A 
number of methods were developed to mitigate this shift (e.g., see [LCWJ15, GL15, 
THSD17, THS+18]). 

To measure the domain shift or domain distance, the inception score (IS) has 
been proposed [SGZ+16], where the classification output of an InceptionvV3- 
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based [SVI+16] discriminator network trained on the ImageNet dataset [DDS+09] 
is used. The works of [HRU+17, BSAG21] rely on features extracted from the 
InceptionvV3 network to tune domain adaptation approaches, i.e., the FID 
and KID. However, these metrics cannot predict if the classification performance 
increases when adapted data is applied as training data for a discriminator [RV 19]. 

Therefore, to measure performance directly, it is essential to train with the adapted 
or synthetic data and validate on the target data, i.e., cross-evaluation as done by 
[RSM+16, WU18, SAS+18]. 


Performance metrics: The mean intersection-over-union (mIoU) is a widely used 
performance metric for benchmarking semantic segmentation [COR+16, VSN+18]. 
Adaptations and improvements of the mIloU have been proposed which put more 
weight on the segmentation contour as in[FMWR18, RTG+19]. Performance metrics 
as the mIoU are computed over the whole validation dataset, i.e., the whole confusion 
matrix, but there are propositions to apply the mIoU calculation on a per-image basis 
and compare the resulting empirical distributions [CLP 13]. 

A per-image comparison mitigates several shortcomings of a single evaluation 
metric on the whole dataset when used for comparison of classifiers on the same 
dataset. First, one can distinguish multimodal and unimodal distributions, i.e., strong 
classification on one half and weak classification on the other half of a set can lead 
to the same mean as an average classification on all samples. Second, unimodal 
distributions with the same mean but different shape are also indiscernible under a 
single dataset averaged metric. This justification led to our choice of a per-image- 
based mloU metric as it allows for deeper investigations which are especially helpful 
when one wants to understand the characteristics that increase or decrease a domain 
divergence. 


Sensor simulation for synthetic image training: The use of synthesized data for 
development and validation is an accepted technique and has been also suggested for 
computer vision applications (e.g., [BB95]). Recently, specifically for the domain of 
driving scenarios, games engines have been adapted [RHK17, DRC+17]. 

Although game engines provide a good starting point to simulate environments, 
they usually only offer a closed rendering set-up with many trade-offs balancing 
between real-time constraints and a subjectively good visual appearance for human 
observers. Specifically the lighting computation in the rendering pipelines is in- 
transparent. Therefore, it does not produce a physically correct imagery; instead only 
a fixed rendering quality (as a function of lighting computation and tone mapping), 
resulting in output of images having a low dynamic range (LDR) (typically 8-bit per 
RGB color channel). 

Recently, physical-based rendering techniques have been applied to the generation 
of data for training and validation, like Synscapes [WU18]. For our chapter we use 
a dataset in high dynamic range (HDR) created with the physical-based Blender 
Cycles renderer.! We implemented a customized tone mapping to 8-bit per color 
channel and sensor simulation, as described in the next section. 


' Provided by the KI Absicherung project [KI 20]. 
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While there is great interest in understanding the domain distance in the area of 
domain adaptation via generative strategies, i.e., GANSs, there has been little research 
regarding sensor artifact influence on training and validation with synthetic images. 
Other works [dCCN+16, NdCCP18] add different kinds of sensor noise to their 
training set and report a degradation of performance, compared to a model trained 
with no noise in the training set, due to training of a harder, i.e., noisier, visual task. 
Adding noise in training is a common technique for image augmentation and can be 
seen as a regularization technique [Bis95] to prevent overfitting. 

Our task of modeling sensor artifacts for synthetic images extracted from camera 
images is not aimed at improving the generalization through random noise, but to 
tune the parameters of our sensor model to closely replicate the real-world images 
and improve generalization on the target data. 

First results of modeling camera effects to improve synthetic data learning on 
the perceptive task of bounding box detection have been proposed by [CSVJR18, 
LLFW20]. Lin et al. [LLFW20] additionally state that generalization is an asymmet- 
ric measure which should be considered when comparing with symmetric dataset 
distance measures from literature. Furthermore, Carlson et al. [CSVJR19] learned 
sensor artifact parameters from a real-world dataset and applied the learned parame- 
ters of their noise sources as image augmentation during training with synthetic data 
on the task of bounding box detection. However, contrasting our approach, they apply 
their optimization as style loss on a latent feature vector extracted from a VGG-16 
network trained on ImageNet and evaluate the performance on the task of 2D object 
detection. 


3 Methods 


Given a synthetic (CGI) dataset of urban street scenes, our goal is to decrease the 
domain gap to a real-world dataset for semantic segmentation by realistic sensor 
artifact simulation. Therefore, we systematically analyze the image sensor artifacts 
of the real-world dataset and use this extracted parametrization for our sensor artifact 
simulation. To compare our synthetic dataset with the real-world dataset we contrive 
a novel per-image performance-based metric to measure the generalization distance 
between the datasets. We utilize a Deeplabv3+ [CZP+18] semantic segmentation 
model with a ResNet101 [HZRS16] backbone to train and evaluate on the different 
datasets throughout this paper. To show the valuable properties of our measure we 
compare it with the established domain distance, i.e., Fréchet inception distance 
(FID). Lastly, we use our measure as optimization criteria for adapting the parameters 
of our sensor artifact simulation with the extracted parameters as starting point and 
show that we can further decrease the domain distance from synthetic images to 
real-world images. 


132 K. Hagn and O. Grau 


3.1 Sensor Simulation 


We implemented a simple sensor model with the principle blocks depicted in Fig. 2: 
The module expects images in linear RGB space. Rendering engines like Blender 
Cycles?’ can provide these images as results in OpenEXR format.’ 

We simulate a simple model by applying chromatic aberration, blur, and sensor 
noise, as additive Gaussian noise (zero mean, variance is a free parameter), followed 
by a simple exposure control (linear tone mapping), finished by non-linear gamma 
correction. 

First, we apply blur by a simple box filter with filter size F x F and a chromatic 
aberration (CA). The CA is approximated using radial distortions (k1, second order), 
e.g., [CV 14], as defined in OpenCV. The CA is implemented as a per channel (red, 
green, blue) variation of the k1 radial distortion, i.e., we introduce an incremen- 
tal parameter ca that affects the radial distortions: k1(blue) = —ca; kl (green) = 
0; kl (red) = +ca. As the next step, we apply Gaussian noise to the input image. 

Applying a linear function, the pixel values are then mapped and rounded to the 
target output byte range [0, ..., 255]. 

The two parameters of the linear mapping are determined by a histogram evalu- 
ation of the input RGB values of the respective image, imitating an auto exposure 
of a real camera. In our experiments we have set it to saturate 2% (initially) of the 
brightest pixel values, as these are usually values of very high brightness, induced 
by sky or even the sun. Values below the minimum or above the set maximum are 
mapped to 0 or 255, respectively. 

In the last step we apply gamma correction to achieve the final processed synthetic 
image: 


x= (Š) (1) 


The parameter is an approximation of the sensor non-linear mapping function. 
For media applications this is usually y = 2.2 for the sRGB color space [RD 14]. 
However, for industrial cameras, this is not yet standardized and some vendors do 


Measurement of 


‘ Histogram Range "aaa" 
synthetic % € (0,..., 255] a 
image image 
x Blur, | p , Linear Mapping, i Gamma ws > 
Chromatic Aberration Rounding Correction 
Noise 
Generator 


Fig. 2 Sensor artifact simulation 


2 https://www.blender.org/. 
3 https://www.openexr.com/. 
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(a) 


Fig. 3 Left a: Synthetic images without lens artifacts. Right b: Applied sensor lens artifacts, 
including exposure control 


not reveal it.t We therefore estimate the parameter as an approximation. Figure 3 
depicts the difference of an image with and without simulated sensor artifacts. 


3.2 Dataset Divergence Measure 


Our proposed distance quantifies per image performance between models trained on 
different datasets but evaluated on the same dataset. Considering the task of semantic 
segmentation we chose the mIoU as our base metric. We then modify the mIoU to be 
calculated per image instead of the confusion matrix on the whole evaluated dataset. 
Next, we introduce the Wasserstein-1 or earth mover’s distance (EMD) metric as 
our divergence measure between the per-image mloU distribution of two classifiers 
trained on distinct datasets, i.e., synthetic and real-world datasets, but evaluated on 
the same real-world dataset the second classifier has been trained with. 
The mloU is defined as follows: 


1 TP, 
mloU =<% x 100%, (2) 
STP, + FP, + FN, 


4 The providers of the Cityscapes dataset don’t document the exact mapping. 
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with T P,, FN,, and FP, being the amount of true-positives, false-negatives, and 
false-positives of the sth class over all images of the evaluated dataset. 

Here, S = {0, 1, ..., S — 1}, with S = 11, as we use the 11 classes defined in 
Table 1. These classes are the maximal overlap of common classes in the real and 
synthetic datasets considered for cross-evaluation and comparison of our measure 
with the Fréchet inception distance (FID), as can be seen later in Sect. 4.3, Tables 3 
and 4. 

A distribution over the per-image IoU takes the following form: 


1 T P, n 
IoU, = `X ; x 100%, (3) 
S seS TP; n + FP; n + FN; n 


where n denotes the nth image in the validation dataset. Here, JoU,, is measured in 
%. We want to compare the distributions of per-image IoU values from two different 
models; therefore, we apply the Wasserstein distance. The Wasserstein distance as an 
optimal mass transport metric from [KPT+17] is defined for density distributions p 
and q where inf denotes the infinium, i.e., lowest transportation cost, F (p, q) denotes 
the set containing all joint distributions 7, i.e., transportation maps, for (X,Y) which 
have the marginals p and q as follows: 


1/r 
Wal inf f X-Fi (4) 
TEL (p,4) JRxR 


This distance formulation is equivalent to the following [RTC17]: 


oo l/r 
wira=( ro- owr) dt. (5) 


Here P and Q denote the respective cumulative distribution functions (CDFs) of p 
and q. 

In our application we calculate the empirical distributions of p and q, which 
simplifies in this case to the function of the order statistics: 


n 1/r 
W, (Ê, â) = (È [Bi — ar) ; (6) 
i=1 


where p and g are the empirical distributions of the marginals p and q sorted in 
ascending order. With r = | and equal weight distributions we get the earth mover’s 
distance (EMD) which, in other words, measures the area between the respective 
CDFs with L; as ground distance. 

We assume a sample size of at least 100 to be enough for the EMD calculation to 
be valid, as fewer samples might not guarantee a sufficient sampling of the domains. 
In our experiments we use sample sizes > 500. 
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Fig. 4 CDF of ensembles of DeeplabV3+ models trained on Cityscapes and evaluated on its val- 
idation set. Applying the 2-sample Kolmogorov-Smirnov test to each possible pair of the ensemble, 
we get a minimum p — value > 0.95 


The FID is a special case of the Wasserstein-2 distance derived from (6) with 
p = 2 and ĵ and ĝ being normally distributed, leading to the following definition: 


FID = ||u — Hyll? +Œ + Ey — AE Ep)", (7) 


where ps and u, are the means, and £ and X, are the covariance matrices of the 
multivariate Gaussian-distributed feature vectors of synthetic and real-world datasets, 
respectively. 

Compared to distance metrics such as the FID which by definition is symmetric, 
our measure is a divergence, i.e., the distance from dataset A to dataset B can be 
different to the distance from dataset B to dataset A. Being a divergence also reflects 
the characteristic of a classifier having different generalization distance when trained 
on dataset A and evaluated on dataset B or the other way around. 

Because the ground measure of the signatures, i.e., the IoU per image, is bounded 
to0 < JoU,, < 100, the EMD measure is then bounded to0 < EMD < 100’ withr 
being the Wasserstein norm. Forr = 1, the measure is bound withO < EMD < 100. 

To verify whether the per-image IoU of a dataset is a good proxy of a dataset’s 
domain distribution, we need to verify that the distribution stays (nearly) constant 
when training from different starting conditions. Therefore, we trained six models 
of the DeeplabV3+ network with the same hyperparameters but different random 
initialization on the Cityscapes dataset and evaluated them on the validation set 
calculating the mloU per image. The resulting distributions of each model in the 
ensemble are converted into a CDF as is shown in Fig. 4. To have a stronger empir- 
ical evidence of the per-image mloU performance distribution being constant for a 
dataset, we apply the two-sample Kolmogorov-Smirnov test on each pair of distri- 
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bution in the ensemble. The resulting p-values are at least > 0.95, hence supporting 
our hypothesis. 


3.3 Datasets 


For our sensor parameter optimization experiments we consider two datasets. First, 
the real-world Cityscapes dataset, which consists of 2,975 annotated images for 
training and 500 annotated images for validation. All images were captured in urban 
street scenes in German cities. Second, the synthetic dataset provided by the KI- 
A project [KI 20]. This dataset consists of 21,802 annotated training images and 
5,164 validation images. The KI-A synthetic dataset comprises urban street scenes, 
similar to Cityscapes, and suburban to rural street scenes which are characterized by 
less traffic and less dense house placements, therefore more vegetation and terrain 
objects. 


4 Results and Discussion 


4.1 Sensor Parameter Extraction 


As a baseline for our sensor simulation, we analyzed images from the Cityscapes 
training data and measured the parameters. Sensor noise was extracted from about 
10 images with uniformly colored areas ranging from dark to light colors. Chromatic 
aberration was extracted from 10 images with traffic signs on the outmost edges of 
the image, as can be seen in Fig. 5. The extracted values have been averaged over the 
count of images. The starting parameters of our optimization approach are then as 
follows: saturation = 2.0%, noise ~ M (0, 3), y = 0.8, F = 4, and ca = 0.08. 


4.2 Sensor Artifact Optimization Experiment 


Utilizing the EMD as dataset divergence measure and the extracted sensor parame- 
ters from camera images of Cityscapes, we apply an optimization strategy to itera- 
tively decrease the gap between the Cityscapes and the synthetic dataset [KI 20]. For 
optimization, we chose to use the trust region reflective (trf) method [SLA+15] as 
implemented in SciPy [VGO+20]. The trf is a least-squares minimization method 
to find the local minimum of a cost function given certain input variables. The cost 
function is the EMD from synthetic model and real-world model predictions on the 
same real-world validation dataset. The variables as input to the cost function are 
the parameters of the sensor artifact simulation. The trf method has the capability 
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Fig. 5 Exemplary manual extraction of sensor parameters from an extracted patch of a Cityscapes 
image on a traffic sign in the top right corner. Diagrams clockwise beginning top left: vertical 
chromatic aberration, noise level on black area, horizontal chromatic aberration, noise level on a 
plain white area 
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Fig. 6 Optimization of sensor artifacts to decrease divergence between real and synthetic datasets 
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of bounding the variables to meaningful ranges. The stop criterion is met when the 
increase of parameter step size or decrease of the cost function is below 10~°. 

The overall description of our optimization method is depicted in Fig.6. Step /: 
Initial parameters from the optimization method are applied in the sensor artifact sim- 
ulation to the synthetic images. Step 2: The DeeplabV3+ model with ResNet101 
backbone is pre-trained on 15 epochs on the original unmodified synthetic dataset 
and finetuned for one epoch on the synthetic dataset with applied sensor artifacts and 
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Fig.7 EMD domain divergence calculation of synthetic and optimized synthetic data to real-world 
images. a Comparison of synthetic data with Cityscapes data, b synthetic sensor artifact optimized 
dataset compared to the target dataset Cityscapes 


a learning rate of 0.1. Step 3: The model parameters are frozen and set to evaluate. 
Step 4: The model predicts on the validation set of the Cityscapes dataset. Step 5: 
The remaining domain divergence is measured by evaluation of the mloU per image 
and calculation of the EMD to the evaluations of a model trained on Cityscapes. 
Step 6: The resulting EMD is fed as cost to the optimization method. Step 7: New 
parameters are set for the sensor artifact simulation, or the optimization ends if the 
stop criteria are met. 

After iterating the parameter optimization with the trf method, we compare our 
optimized trained model with the unmodified synthetic dataset by their per-image 
mloU distributions on the Cityscapes dataset. Figure 7 depicts the distributions result- 
ing from this evaluation. The DeeplabV3+ model trained with the optimized 
sensor artifact simulation applied on the synthetic dataset outperforms the base- 
line and achieves an EMD score of 26.48, while decreasing the domain gap by 
6.19. The resulting parameters are saturation = 2.11%, noise ~ N’(0, 3.0000005) , 
y = 0.800001, F = 4 and ca = 0.008000005. The parameters changed only slightly 
from the starting point, indicating the extracted parameters as good first choice. 

An exemplary visual inspection of the results in Fig.8 helps to understand the 
distribution shift and therefore the decreased EMD. While the best prediction per- 
formance image (top row) increased only slightly from the synthetic trained model 
(c) to the sensor artifact optimized model (d), the worst prediction case (bottom row) 
shows improved segmentation performance for the sensor-artifact-optimized model 
(d), in this case even better than the Cityscapes trained model (b). 
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(a) Original (b) CS Pred (c) Synth Pred (d) Synth Opt Pred 


Fig. 8 Top row: Best performance predictions. Bottom row: Worst performance predictions. a 
Original, b Cityscapes-trained model prediction, ¢ synthetically trained model prediction, and d 
sensor artifact optimized trained model prediction. While top performance increased only slightly, 
the optimization lead to more robust predictions in worst case, i.e., harder examples 


We compare the overall mIoU performance on the Cityscapes datasets between 
models trained with the initial unmodified synthetic dataset, the synthetic dataset with 
random initialized lens artifact parameters, and the synthetic dataset with extracted 
parameters from Cityscapes with the baseline of a model trained on the Cityscapes 
dataset. Results are listed in Table2 (rows 1—4). Additionally, for the random and 
the extracted parameters, we evaluate the performance with initial and optimized 
parameters, where the parameters have been optimized by our EMD minimization 
(rows 5 and 6). While the model without any sensor simulation achieves the low- 
est overall performance (row 2), the model with random parameter initialization 
achieves a slightly higher performance (row 3) and is surpassed by the model with 
the Cityscapes extracted parameters (row 4). Next, we take the models trained with 
optimized parameters into account (rows 5 and 6). Both models outperform all non- 
optimized experiment settings in terms of overall mloU, with the model using opti- 
mized extracted parameters from Cityscapes showing the best overall mIoU (row 6). 
Concretely, the model trained with optimized random starting parameters achieves 
higher performance on classes road, sidewalk, human, and even significantly on the 
car class but still falls behind on five of the remaining classes and the overall perfor- 
mance on the Cityscapes dataset (row 5). Further, the random parameter optimized 
model took over 22 iterations to converge to its local minimum, whereas the opti- 
mization of extracted starting parameters only took six iterations until reaching a 
local minimum, making it more than three times faster to converge. Furthermore, 
it is shown that all models with applied sensor lens artifacts outperform the model 
trained without additional lens artifacts. 


4.3 EMD Cross-evaluation 


To get a deeper understanding of the implications of our EMD score, we evaluate our 
EMD results on a range of real-world and synthetic datasets for semantic segmenta- 
tion. Including real-world datasets A2D2 [GKM+20], Cityscapes (CS) [COR+16], 
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Table 3 Cross-domain divergence results of models trained on different real-world and synthetic 
datasets and evaluated on various validation or test sets of an average size of 1000 images. The 
domain divergence is measured with our proposed EMD measure; boldface values indicate the 
lowest divergence comparing our synthetic (Synth) and synthetic-optimized (SynthOpt) datasets, 
whereas underlined values indicate the lowest divergence values over all the datasets. The model 
trained with optimized lens artifacts applied to the synthetic images exhibits a smaller domain 
divergence than the model trained without lens artifacts 


EMD | A2D2 | BDD100K | CS GTAV |IDD | MV SYNS | Synth | SynthOpt 
A2D2 - 18.70 23.46 | 37.84 | 20.03 | 10.72 | 46.78 | 34.95 | 29.32 
BDD100K_ | 6.36 - 9.45 |22.14 |7.26 |1.42 | 36.33 | 26.43 | 21.54 
CS 10.90 | 12.09 - 36.42 | 13.01 |4.08 | 20.62 | 32.66 | 26.48 
GTAV 33.28 | 28.37 29.72 |- 30.55 | 23.30 | 37.53 | 36.08 | 32.94 
IDD 24.37 | 19.83 24.71 | 34.71 |- 12.81 | 46.23 | 41.64 | 36.95 
MV 10.63 | 10.36 14.34 | 28.35 |9.20 |- 35.97 | 30.35 | 27.03 
SYNS 25.45 | 31.46 23.64 | 45.16 | 25.12 | 23.45 |- 43.76 | 43.56 


Berkeley Deep Drive (BDD100K) [YCW+20], Mapillary Vistas (MV) [NOBK17], 
India Driving Dataset IDD) [VSN+18], as well as synthetic GTAV [RVRK16], our 
synthetic (Synth and SynthOpt) [KI 20], and Synscapes (SYNS) [WU18] datasets. 
In Table3 the results of cross-domain analysis measured with the EMD score are 
depicted. The columns denote that a DeeplabV3+ model has been trained on the 
corresponding dataset, i.e., the source dataset, whereas the rows denote the datasets 
it was evaluated on, i.e., the target datasets. Our optimized synthetic dataset achieves 
lower EMD scores, shown in boldface, than the synthetic baseline. While the domain 
divergence decrease is high on real datasets, the divergence decreased only marginally 
for the other synthetic datasets. Inspecting the EMD result on all datasets, the lowest 
divergence values are indicated by underline; the MV dataset shows to be closest to 
all the other evaluated datasets. 

To set our measure in relation to established domain distance measures, we cal- 
culated the FID from each of our considered datasets to one another. The results are 
shown in Table4. The FID, defined in (7), is the Wasserstein-2 distance of feature 
vectors from the Inceptionv3 [SVI+16] network sampled on the two datasets to 
be compared with each other. 

Again, boldface values indicate the lowest FID values between the synthetic 
(Synth) and synthetic-optimized (SynthOpt) datasets, whereas underlined values 
indicate the lowest values of all datasets. Here, only 4 out of the 7 datasets are 
closer, measured by the FID, to the synthetic-optimized dataset than to the origi- 
nal dataset. Furthermore, the FID sees the CS and the SYNS dataset closer to one 
another than the EMD divergence measure, while the MV dataset shows the lowest 
FID among the other evaluated datasets. 

FID and EMD somewhat agree, if we evaluate the distance as minimum per-row 
in both tables, that the Mapillary Vistas dataset is in most cases the dataset that is 
closest to all other datasets. 


Optimized Data Synthesis for DNN Training and Validation ... 143 


Table 4 Cross-domain distance results measured with the Fréchet inception distance (FID). Lowest 
FID between synthetic (Synth) and synthetic optimized (SynthOpt) datasets are in boldface, whereas 
the lowest FID values over all datasets are underlined 


FID | BDD100K |CS GTAV | IDD |MV SYNS | Synth | SynthOpt 
A2D2 60.16 98.46 |78.16 |58.75 |41.84 |109.35 | 116.54| 121.55 
BDD100K - 59.90 | 62.42 |52.15 | 29.66 | 74.871 | 115.51} 109.08 
CS 59.90 -= 85.81 | 68.92 |59.69 |43.87 |119.97| 112.42 
GTAV 62.42 85.81 |- 74.08 |51.00 |89.62 |92.513| 92.24 
IDD 52.15 68.92 |74.08 |- 37.36 |64.09 |118.06| 125.30 
MV 29.66 59.69 |51.00 |37.36 |- 70.24 | 74.46 | 78.66 
SYNS 74.87 43.87 |89.62 |64.09 |70.24 |- 113.77 | 108.3 


Now, calculating the minimum per-column in both tables, the benefit of our 
asymmetric EMD comes to the light. The minimum per-column values of the FID 
are unchanged due to the diagonal symmetry of the cross-evaluation matrix stem- 
ming from the inherent symmetry of the measure. However, the EMD regards the 
BDD100K as the closest dataset. An intuitive explanation for the different minimum 
observations of the EMD is as follows: Training with many images exhibiting dif- 
ferent geospatial and sensor properties of the Mapillary Vistas dataset covers a very 
broad domain and results in good generalization capability and therefore evaluation 
performance. Training with any of the other datasets cannot generalize well to the 
vast domain of Mapillary Vistas but to the rather constrained domain of BDD100K, 
which consists of lower resolution images with heavy compression artifacts, where 
even a model that has been trained on BDD100K does not generalize well on. 

The asymmetric nature of our EMD allows for a more thorough and complex 
analysis of dataset discrepancies, when applied to the tasks of visual understanding, 
e.g., semantic segmentation, which otherwise cannot be captured by inherently sym- 
metric distance metrics such as FID. Contrasting to [LLFW20], we could with our 
evaluation method not identify a consistency between FID and the generalization 
divergence, i.e., our EMD measure. 


5 Conclusions 


In this chapter, we could demonstrate that by utilizing the performance metric per 
image as a proxy distribution for a dataset and the earth mover’s distance (EMD) 
as a divergence measure between distributions, one can decrease visual differences 
of a synthetic dataset through optimization and increase the viability of CGI for 
training and validation purposes of perceptive AI. To reinforce our argument for 
per-image performance measures as proxy distributions, we showed that training an 
ensemble of a fixed model with different random starting conditions but with the same 
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hyperparameters leads to the same per-image performance distributions when these 
ensemble models are evaluated on the validation set of the training dataset. When 
utilizing synthetic imagery for validation, the domain gap, due to visual differences 
between real and computer-generated images, is hindering the applicability of these 
datasets. As a step toward decreasing the visual differences, we apply the proposed 
divergence measure as a cost function to an optimization which varies the parameters 
of the sensor artifact simulation, while trying to re-create the sensor artifacts that the 
real-world dataset exhibits. As starting point of the sensor artifact parameters, we 
extracted empirically the values from chosen images of the real-world dataset. The 
optimization improved the visual difference between the real-world and the optimized 
synthetic dataset measurably by the EMD and we could show that even when starting 
with random initialized parameters we can decrease the EMD and increase the mloU 
on the target datasets. When measuring the divergence after parameter optimization 
to other real-world and synthetic datasets, we could show that the EMD decreases 
for all considered datasets but when measured by the FID only four of the datasets 
are closer. As the EMD is derived from the mloU per image, it is an indicator of 
performance on the target dataset, whereas the FID fails to relate with performance. 
Effective minimization of the visual difference between synthetic and real-world 
datasets with the EMD domain divergence measure is one step further toward fully 
utilizing CGI for validation of perceptive AI functions. 
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Abstract While deep neural networks for environment perception tasks in 
autonomous driving systems often achieve impressive performance on clean and 
well-prepared images, their robustness under real conditions, i.e., on images being 
perturbed with noise patterns or adversarial attacks, is often subject to a significantly 
decreased performance. In this chapter, we address this problem for the task of seman- 
tic segmentation by proposing multi-task training with the additional task of depth 
estimation with the goal to improve the DNN robustness. This method has a very 
wide potential applicability as the additional depth estimation task can be trained in a 
self-supervised fashion, relying only on unlabeled image sequences during training. 
The final trained segmentation DNN is, however, still applicable on a single-image 
basis during inference without additional computational overhead compared to the 
single-task model. Additionally, our evaluation introduces a measure which allows 
for a meaningful comparison between different noise and attack types. We show 
the effectiveness of our approach on the Cityscapes and KITTI datasets, where our 
method improves the DNN performance w.r.t. the single-task baseline in terms of 
robustness against multiple noise and adversarial attack types, which is supplemented 
by an improved absolute prediction performance of the resulting DNN. 


1 Introduction 


Motivation: For a safe operation of highly automated driving systems, a reliable 
perception of the environment is crucial. Various perception tasks such as semantic 
segmentation [LSD15], [CPK+18], depth estimation [EPF14], [ZBSL17], or optical 
flow estimation [LLKX19] are often implemented by deep neural networks (DNNs). 
The output of these DNNs is then used to build a model of the environment, which 
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is subsequently used for decision making in high-level planning systems. As many 
decisions are thereby executed upon the DNN predictions, these predictions need to 
be reliable under all kinds of environmental changes. This is, however, in contrast 
to typical DNN training, which is carried out on annotated datasets [COR+16], 
[NOBK17] covering only a small portion of possible environment variations as 
more diverse datasets are often not available. During deployment, the DNN per- 
formance can therefore drop significantly due to domain shifts not covered in the 
training dataset. These include, for example, noises induced by different optical sen- 
sors, brightness and weather changes, deployment in a different country, or even 
directed adversarial attacks [GSS15]. Therefore, for a safe deployment of DNNs in 
autonomous driving vehicles, their performance also needs to be robust w.r.t. these 
environment changes. In this chapter we aim at an improved robustness for the task 
of semantic segmentation. 


Robustness of DNNs: Improving the robustness of DNNs is a highly relevant 
research topic for applications such as autonomous driving, where the perception 
system has to deal with varying environmental conditions and potentially even 
with adversarial attacks. As adversarial attacks usually rely on the computation 
of gradients on the input [GSS15], [MMS+18], it has been proposed to apply a 
non-differentiable pre-processing [GRCvdM18], [LLL+19] such that the impact of 
the perturbation on the DNN performance is reduced. However, the gradients of 
these pre-processings can be approximated [ACW 18] so that these strategies usually 
only make an attack harder to calculate. Our approach therefore focuses on a more 
robust training of the DNN where often adversarial examples or image augmenta- 
tions [GSS15], [MMS+18] are utilized already during training such that the DNN 
will be more robust to those during inference. While these approaches often induce a 
decreased performance on clean images, we aim at developing a method improving 
performance as well as robustness. 


Multi-task learning: The training of several tasks in a single multi-task DNN, i.e., 
multi-task learning, is known to improve the absolute prediction performance of the 
single tasks as well as the efficiency of their computation due to the shared network 
parts [EF15], [KTMFs20]. In this chapter, as shown in Fig. 1, we use a multi-task 


Single-Task Trained Model Robust Multi-Task Trained Model 


3% Feature Segmentation "lz Feature eee 
€ e| | Xe 
Extractor Decoder Extractor 1 eee 
i 
i 
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7 ept! d 
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Fig. 1 Robustness improvement through multi-task learning during training. When the input x, is 
subject to perturbations of strength e, the output (segmentation mę, depth de) of a multi-task trained 
model (right-hand side) is still well predicted, while the output quality of the single-task trained 
model (left-hand side) is strongly impaired 
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network for semantic segmentation and depth estimation, for which these properties 
have also been shown in several works [KGC18], [KTMFs20]. Particularly, we follow 
[KBFs20] in focusing on the additional robustness-increasing properties of such a 
multi-task training. Moreover, it is important to note that the depth estimation can 
be trained in a self-supervised fashion, which only requires short unlabeled video 
sequences and thereby usually does not impose large additional requirements in terms 
of data. While it was still necessary to manually find a good weighting between the 
single tasks in [KBFs20], we apply the GradNorm task weighting strategy [CBLR18], 
which automatically sets and adapts the task weighting according to the task-specific 
training progress. Also, we show the method’s applicability across a wider range of 
datasets. 


Comparability of perturbations: Up to now, a wide variety of different possible 
adversarial attack and noise types has been proposed [GW08], [GSS15], [CW17], 
[MMS+18]. However, each of these image perturbations is characterized by its own 
set of parameters, e.g., the standard deviation of the Gaussian noise distribution 
or the maximum noise value for many adversarial attacks. This makes it hard to 
compare perturbation strengths in a single mutual evaluation. To better compare 
these effects and to be able to draw conclusions between different noise and attack 
types, we employ a measure based on the signal-to-noise ratio, which enables a fair 
comparison between different perturbation types in terms of strength. 


Contributions: To sum up, our contribution with this chapter is threefold. First, we 
propose a multi-task learning strategy with the task of self-supervised monocular 
depth estimation to improve a DNN’s robustness. Second, we provide a detailed 
analysis of the positive effects of this method on DNN performance and robustness 
to multiple input perturbations. Third, we employ a general measure for perturbation 
strength, thereby making different noise and attack perturbation types comparable. 


2 Related Works 


In this section, we give an overview of related multi-task learning approaches for 
depth estimation and semantic segmentation. Afterward, we discuss methods to 
improve the robustness of DNNs focusing on approaches for semantic segmenta- 
tion. 


Multi-task learning The performance of basic perception tasks such as seman- 
tic segmentation [LSD15] and depth estimation [EF15] has increased significantly 
by employing fully convolutional neural networks. Furthermore, these tasks can 
be learned jointly in a multi-task learning setup by employing a shared encoder, 
which was shown to be of mutual benefit for both tasks through the more exten- 
sively learned scene understanding [EF15], [KGC18]. This could be further facil- 
itated by combining the loss functions to enforce cross-task consistency during 
optimization [KGC18], LXOWS18], [ZCX+19]. For depth estimation, the research 
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focus shifted from supervised to self-supervised training techniques [ZBSL17], 
[GMAB17], [GMAFB 19], due to the more general applicability on unlabeled videos, 
which are more readily available than labeled datasets. Accordingly, such tech- 
niques have also been employed in multi-task setups with semantic segmentation 
[CLLW 19], [GHL+20], [NVA19]. Usually, the focus of these works is to improve 
the absolute prediction performance of the single involved tasks. Thereby, it was 
proposed to exclude the influence of dynamic objects such as cars or pedestrians 
[KTMFs20], [RKKY+21b], [RKKY+21a] to employ pixel-adaptive convolution for 
improved semantic guidance [GHL+20], [RKKY+21b], or to enforce cross-task 
(edge) consistency between both tasks’ outputs [CLLW19], [ZBL20], [YZS+18], 
[MLR+19]. The approaches have also been extended from simple pinhole camera 
models to more general fisheye camera models [RKKY+21b], [RKKY+21a]. For 
semantic segmentation, the depth estimation can also be beneficial in unsupervised 
domain adaptation approaches [KTMFs20], [LRLG19], [VJB+19]. 

In this chapter, we build upon these advances by employing a multi-task learning 
setup of self-supervised depth estimation and semantic segmentation as introduced 
by Klingner et al. [KTMFs20]. However, in contrast to [KTMFs20], we put the 
focus rather on the robustness instead of the absolute prediction performance of the 
resulting semantic segmentation DNN. 


Robustness of (semantic segmentation) DNNs While DNNs can achieve an impres- 
sive performance on clean images, their performance is usually not robust w.r.t. addi- 
tive noise [Bis95], [HK92]. For safety-critical application this is a particularly high 
risk, if this noise is calculated in a way that it is nearly not recognizable if added to the 
image, but still heavily impairs the performance, as shown by the works on adversarial 
examples by Szegedy et al. [SZS+14]. Consequently, subsequent works developed 
various kinds of targeted and non-targeted perturbations, calculated in an image- 
specific fashion and optimized to fool the DNN. These adversarial examples range 
from simple non-iterative methods such as the fast gradient sign method (FGSM) 
[GSS15] to more complex iteratively calculated methods such as the momentum iter- 
ative fast gradient sign method (MI-FGSM) [DLP+18], the Carlini and Wagner attack 
(CW) [CW17], or the projected gradient descent (PGD) method [MMS+18]. While 
these image-specific attacks may not be a relevant danger in real applications due to 
the high computational effort per image, the existence of image-agnostic adversarial 
perturbations (UAPs) has also been shown, e.g., by prior-driven uncertainty esti- 
mation (PD-UA) [LJL+19], universal adversarial perturbation (UAP) [MDFFF17], 
or fast feature fool (FFF) [MGB17]. Although most of these works focus on the 
rather simple task of image classification, the applicability of these attacks to seman- 
tic segmentation is well known [AMT18], [MGR19], [BLK+21]. Furthermore, the 
attacks can be designed in a fashion such that the semantic segmentation outputs are 
completely wrong but still appear realistic [ASG+19], [MCKBF17]. 

The vulnerability of DNNs to different kinds of adversarial attacks has encour- 
aged research in defense methods, which improve the robustness against these per- 
turbations. They can be roughly divided into three categories. First, for gradient 
masking, the idea is to prevent a potential attacker from calculating the gradients 
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[CBG+17], IMMS+18], [PMG+17]. However, Athalye et al. [ACW 18] showed that 
the perturbations can be calculated from a different model. Also, they can be calcu- 
lated in an image-agnostic (universal) fashion [MCKBF17], [MDFFF17], see also 
the Chapter “Improving Transferability of Generated Universal Adversarial Per- 
turbations for Image Classification and Segmentation” [HAMFs22], such that gra- 
dient masking methods cannot prevent the network’s vulnerability to adversarial 
attacks under all conditions. Second, through input transformations, the perturba- 
tions can be removed from the input image [GRCvdM18], [KBV+21], [LLL+19], 
e.g., by JPEG image compression [ATT18] or by incorporation of randomness 
[RSFM19], [XWZ+18]. While this improves the performance on attacked images, 
these transformations usually impair the performance on clean images, introducing 
an unfavorable trade-off. Third, redundancy among two or three networks serving 
the same task can be exploited, if networks are enforced to reveal some indepen- 
dence [BHSFs19], [BK V+20]. Fourth, robust training methods can be employed 
such that the DNN is less sensitive to fail because of attacks from the start. Com- 
mon approaches include, e.g., adversarial training [GSS 15], [MMS+18], robustness- 
oriented loss functions [BHSFs19], [BK V+20], [CLC+19], or self-supervision from 
auxiliary tasks [CLC+20], [HMKS19], [ZYC+20] during (pre-)training. Our chapter 
also focuses on robustness through self-supervision, where we introduce multi-task 
learning with the auxiliary task of depth estimation as a method to improve a seman- 
tic segmentation DNN’s robustness, which can be seen as an extension of [KBFs20]. 
Thereby, we achieve an improved performance, while also being more robust and 
even introducing a second useful task for scene perception. Compared to [KBFs20], 
we improve the multi-task learning strategy to reduce the need for manual task 
weighting and provide experiments across a wider range of datasets. 


3 Multi-task Training Approach 


Describing our multi-task learning approach, we start by defining our primary seman- 
tic segmentation task, followed by the auxiliary task of depth estimation. Finally, we 
describe how both tasks are trained in a multi-task learning setup as shown in Fig. 2. 


Primary semantic segmentation: Semantic segmentation is defined as the pixel- 
wise classification of an input image x, = (x,,;) € G?*"*© at time t, with height 
H, width W, number of channels C = 3, and G = {0, 1, ..., 255} (cf. top branch of 
Fig. 2). Accordingly, for each pixel x; ; € G? with pixel index i € J = {1,..., H - W} 
an output yz; = (Ynis) € I'/S!, I = [0, 1] is predicted, which can be interpreted as 
the posterior probabilities that the pixel at index i belongs to a class s € S from 
the set of classes S = {1, 2,..., |S|}, with the number of classes |S|. The final 
predicted semantic class m; ; € S is determined by applying the argmax operation 
Mi, i = argmax,-¢y;,i,s- Thereby, the network output y; € I7*™'S! is converted to a 
segmentation mask m; = (m; ;) € SYW. 
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Fig. 2 Multi-task training setup across domains for the joint learning of depth estimation and 
semantic segmentation. While the semantic segmentation is trained on labeled image pairs, the 
depth estimation is trained on unlabeled image sequences 


As shown in Fig. 2, training of a semantic segmentation network requires ground 
truth labels y, € {0, 1}”* Wx\S| which are derived from the ground truth segmenta- 
tion mask m, € S”*™ represented by a one-hot encoding. These are utilized to train 
the network by a cross-entropy loss function as 


se; 1 y 
J = — D Èo Wris “108 (Ynis) > » 


icl ses 


where ws are class-wise weights obtained as outlined in [PCKC16], and Y, ; , € {0, 1} 
are the single elements from the one-hot encoded ground truth tensor y, = (Y; ;, s). 


Auxiliary depth estimation: Aiming at a more robust feature representation, we 
employ the auxiliary depth estimation task, which can be trained on unlabeled image 
sequences, as shown in Fig. 2, bottom part. During training, the depth estimation 
DNN predicts a depth map d; = (d; ;) € D”*™, representing the distance of each 
pixel from the camera plane, where D = [dmin, dmax] represents the space of consid- 
ered depth values constrained between the lower bound dmin and the upper bound 
dmax. We optimize the depth estimation DNN in a self-supervised fashion by con- 
sidering two loss terms: First, we make use of the image reconstruction term yee 
(photometric loss), which essentially uses the depth estimation in conjunction with 
the relative pose between two images, to optimize the camera reprojection models 
between two consecutive frames of a video. Second, we apply a smoothness loss term 
J°™, allowing high gradients in the depth estimate’s values only in image regions 
with high color gradients, such that the total loss can be written as 


depth h sm 
J = JP pI”, (2) 
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where 8 = 107? is adopted from previous works [CPMA19], [GMAFB19], 
[KTMFs20]. 

To optimize the network, we rely solely on sequential pairs of images x,, xy with 
t' € T’ = {t—1,t+1}, which are taken from a video. These pairs are passed to an 
additional pose network, which predicts the relative poses T, € SE(3) between 
the image pairs, where SE (3) is the special Euclidean group representing the set of all 
possible rotations and translations [Sze 10]. The network predicts this transformation 
in an axis-angle representation such that only the six degrees of freedom are predicted, 
which are canonically converted to a 4 x 4 matrix for further processing. By letting 
the depth network predict the depth d;, both outputs, i.e., the depth d, and the poses 
T,.,, can be used to project the image frame xy at time t’ onto the pixel coordinates 
of the image frame x;, which results in two projected images Xy, t' € T” (for a 
detailed description, we refer to [KTMFs20]). Conclusively, the reprojection model 
can be optimized by minimizing the pixel-wise distance between the projected images 
Xr = (Xi) and the actual images x, = (X;,;) as 


1 . 1 
Poa 2 (Z 0 = SSIM; (X41) +=) & [xu woni): 
3) 


This so-called minimum reprojection loss or photometric loss [GMAFB 19] utilizes 
a mixture of the structural similarity (SSIM) difference term SSIM; (-) € I, with 
I = [0, 1], and is computed on 3 x 3 patches of the input, and an L, difference term 
||- ||, computed over all C = 3 gray value channels. For optimal absolute prediction 
performance, the terms are weighted by a factor y = 0.85, chosen as in previous 
approaches [CPMA19], [GMAFB19], [KTMFs20], [YS18]. The depth and pose 
networks are then implicitly optimized, as their outputs are the parameters of the 
projection model, used to obtain the projected image x,_,,, whose distance to the 
actual image x; is minimized by (3). 

As the photometric loss yee does not enforce a relationship in the depth map 
between depth values of neighboring pixels, we use a second smoothness loss term 
J” [GMAB17], allowing non-smoothness in the depth map d, only in image loca- 
tions with strong color gradients. This loss is computed on the mean-normalized 
inverse depth p, € R“*™, whose elements can be obtained from the depth map as 


Pi = a f+— with Pti = A Thereby, the loss can be formulated as 
HW Z jer Pt.j dii 
ie = 1 |On Bril exp a OnXt.i 1 + |OwPr.i| exp = OwX: i i f 
z1 iel C C 
(4) 
with one-dimensional difference quotients ôn /;,; = Oh Pt,(h,w) = Pr,(h,w) — Bt.(h-1,w) 


and Ow Øri = OwPt,(h,w) = Pr, hw) — Pr,(h,w—1) defined with respect to the height and 
width dimensions of the image, respectively. The indices h and w represent the exact 
pixel position in two dimensions, where h € {2, ..., H} and w € {2,..., W}, where 
we exclude h = 1 and w = 1 to ensure the existence of a preceding pixel in height 
and width dimensions. 
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Multi-task learning: To train our model directly in a one-stage fashion without 
a pre-trained semantic segmentation network, we choose a multi-task setup with a 
shared encoder and two task-specific decoder heads as shown in Fig. 2. The decoders 
for semantic segmentation and depth estimation are optimized for their respective 
tasks according to the losses defined in (1) and (2), respectively, while the encoder is 
optimized for both tasks. As in [GL15], [KTMFs20], we let the task-specific gradients 
a and g; ® propagate unscaled in the respective decoders, while their influence 
when reaching the encoder layers during backpropagation are scaled as 


goel — depth gipih | seg ie (5) 


where the scalar weight A°° and A®P® determine the weighting between the two 
tasks. In the other encoder layers, the backpropagation is then executed as usual. 
By scaling the gradients instead of the losses, the two decoders can learn optimal 
task-specific weights, while their influence on the shared encoder can be scaled to 
the optimal ratio using (5). Thereby, the encoder does not only learn optimal features 
for the semantic segmentation task, but can also take profit from the additional data 
accessible by the depth estimation, which can be trained on arbitrary videos. 

In this chapter, we compare two kinds of task weightings: First, we apply the 
gradient weighting (GW) from [KBFs20], where we set A®8 = A and A%®P™® — 1 — À 
to scale the influence of both tasks. Here, the hyperparameter A needs to be tuned to 
find the optimal weighting of both tasks. However, the results from [KBFs20] show 
that for a badly chosen hyperparameter A, performance decreases drastically, which is 
why we apply the GradNorm (GN) multi-task learning technique [CBLR18], where 
the scale factors \°°? and \¢°?"" are reformulated as learnable parameters A°°8(7) and 
AdPth (7) which are adapted at each learning step 7. For simplicity, we henceforth 
abbreviate the tasks with an index k with k € K = {depth, seg}. Thereby, the loss 
function used for optimization of the scale factors, i.e., the task weights, is calculated 
as follows: 


wojo- (BL ero): (Po)'| © 


keK KEK 


It depends on the magnitude G® (7)= | NO (rg (T) | 2 of the task-specific gra- 
dients g” (T) in the last shared layer and the task-specific training rates rP (r) = 
a aT =l ~ = 
JP (r) (a Fodh (7)) with J (r) = J) - (Jf) “, depending on 


the value of the task-specific losses JPT) taken from (1) and (2) at each step 7 in 
comparison to their values JPO at step T = 0. These training rates can be inter- 
preted as the convergence progress of the single tasks. Note that through the loss 
for the scaling factors in (6) a similar and stable convergence speed of both tasks 
is encouraged, which is essential for a successful multi-task training. The network 
is optimized with alternating updates of the scaling factors A® (r) by (6) and the 
network weights by (1) and (2). Although a more stable convergence is generally 
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expected, one can still optimize GradNorm (GN) with the hyperparameter a in (6), 
controlling the restoring force back to balanced training rates. Also, after each step, 
the task weights are renormalized such that A®8 (r) + AP® (r) = 1. 


4 Evaluation Setup 


In this section, we present the input perturbations used to evaluate the model robust- 
ness, define the perturbation strength, and detail our chosen evaluation metrics. 
Finally, we provide an overview of our chosen databases and the implementation 
details of our method. The evaluation setup is depicted in Fig. 3. 


Image perturbations: After the semantic segmentation network has been trained 
(either by single-task or multi-task training), we evaluate its robustness during infer- 
ence by adding a perturbation r. of strength e to each input image x, yielding a 
perturbed input image x, = x + re. For simplicity, we omit the subscript t as the 
evaluation is carried out on single images. We conduct experiments using two noise 
types and two adversarial attacks, for which we aim at imposing a perturbation 
strength e to make different perturbation types comparable. We measure this com- 
parable strength by the signal-to-noise ratio (SNR) on the input image xè, which 
is defined as 


(7) 


Here, E (IIx IŻ) is the expectation value of the sum of the image’s squared gray val- 
ues, and E (I rell 2) is the expectation value of the sum of the squared noise pixels. As 
E (I xl 2) is always equal for different perturbation types, we define the perturbation’s 
strength in dependency of E (llre IŻ) as 


1 
e= y zyc E (Ie). (8) 


perturbation re Semantic Segmentation 


Fig. 3 Evaluation setup using additive perturbations to evaluate the robustness of a segmentation 
DNN. As perturbations, we use various noise and attack types to simulate deployment conditions 
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We consider Gaussian noise, salt and pepper noise, the fast gradient sign method 
(FGSM) attack [GSS 15], and the projected gradient descent (PGD) attack [KGB17]. 
For Gaussian noise, the perturbation strength can be identified as the standard devi- 
ation of the underlying Gaussian distribution. For salt and pepper noise, on the other 
hand, some pixels are randomly set to 0 or 255. Therefore, first the input is perturbed, 
then the perturbation is obtained by re = x, — x, and finally the perturbation strength 
is computed by (8). 

For the FGSM adversarial attack, the perturbation is calculated according to 


Xe = X + € - sign (Vx J” Ọ, y (x))), (9) 


where Vx represents the derivative of the loss function with respect to the unperturbed 
image x, and sign(-) represents the signum function applied to each element of 
its vectorial argument. As all elements re j of the perturbation r, can only take 
on values rej; = te, j € {1, ..., H - W - C}, the perturbation variance is equal to e 
when applying (8). We also consider the PGD adversarial attack, due to its more 
advanced attack design, which is optimized over several optimization steps. This 
attack can be interpreted as an iterative version of (9) and allows investigations of 
the network robustness w.r.t. stronger adversarial attacks. 


Evaluation metrics: To evaluate DNN performance on a perturbed input image x, 
we pass this image to the DNN and generate the output m,, see Fig. 3. The output 
quality for a perturbation with e > 0 is typically worse than the output generated from 
clean images x,—9. The absolute segmentation performance can then be obtained by 
calculating the mean intersection-over-union metric [EVGW+15] as 


mloU 1 y TP« s 
oU, = A 
S| & TPes + FPes + FNes 


SE 


(10) 


where the number of true positives TP. s, false positives FP, s, and false negatives 
FN.,; for each class s are calculated over the entire evaluation subset before the 
application of (10). 

As different models may have a different absolute prediction performance 
mloU,<o on clean images, a better performance on perturbed images can either mean 
that the model is better in general or more robust. Therefore, we rather compare the 


mloU ratio 
mloU, 
Q = ——— (11) 


mloU,—o 


obtained from the performance mloU, on perturbed images with a perturbation of 
strength e, normalized by the performance on clean images. 


Databases: The additional training with the auxiliary task of depth estimation in a 
self-supervised fashion is applied jointly with the training of the primary semantic 
segmentation task. To accomplish this, for training, we always rely on an unlabeled 
image sequence dataset (bottom part of Table |(a)) to train the depth estimation and 
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Table 1 (a) Overview on the employed datasets with their respective subsets; (b) the employed 
image perturbations used to evaluate the robustness of the DNN (right-hand side) 


(a) Databases; number of images 


Dataset Symbol Training Validation Test 
Cityscapes [COR+16] | v°S 2,975 500 1525 
KITTI att 28,937 200 200 
[GLSU13, MG15] 
Cityscapes (seq.) XCS.8eq 83,300 -— - 
[COR+16] 

(b) Perturbations 
Perturbation Type 
Gaussian noise Random noise 


Salt and pepper noise | Random noise 


FGSM [GSS15] Adversarial 
attack 

PGD [KGB17] Adversarial 
attack 


one labeled image dataset (top part of Table 1(a)) to train the semantic segmenta- 
tion. Both datasets can be from different domains, as [KTMFs20] showed that a 
successful training for both tasks is also possible then. For segmentation training, 
we use Cityscapes [COR+16] (XCS_), while for depth estimation training we either 


train 

use Cityscapes [COR+16] ( KO) or KITTI [GLSU13] (XST), with the KITTI 
training split defined by [GMAB17]. Note that each image of the Cityscapes seg- 
mentation dataset contains 19 preceding and 10 succeeding unlabeled image frames, 
which we use for the depth estimation training. The number of training images for 
the depth estimation training splits deviates slightly from the original definitions 
due to the need of a preceding and a succeeding frame. For evaluation, we use the 
validation set from Cityscapes (X mae as the test sets are not publicly available. We 
also evaluate on the training set from the KITTI 2015 Stereo dataset [MG15] (X87 


val 7? 
which is disjoint from the training split as outlined in [GMAB17]. 


Implementation details: All models as well as training and evaluation protocols 
are implemented in PyTorch [PGM+19] and executed on an NVIDIA Tesla 
P100 graphics card. Same as [K TMFs20], we choose an encoder-decoder multi-task 
network architecture based on a ResNet 18 feature extractor [HZRS16] pre-trained 
on ImageNet [RDS+15] and two task-specific decoder heads. These heads have 
an identical architecture except for the last layer: For depth estimation we employ 
a sigmoid output ø € I¥*W, which is converted to depth values in a pixel-wise 
fashion as i where a and b define the depth to the considered range [0.1, 100]. 
The segmentation output logits are comprised of S = |S] feature maps which are 
converted to class probabilities via a softmax function. The pose network, which is 
required to train the depth estimation in a self-supervised fashion on videos, utilizes 
the same network architecture as in [GMAFB19]. 
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We resize all images used for depth estimation training to a resolution of 640 x 
192, while the images used for segmentation training are resized to 512 x 1024 and 
cropped to the same resolution. We train the network for 40 epochs using the Adam 
optimizer [KB15] with a learning rate of 1074, which is reduced to 1075 for the last 
10 epochs. To ensure fairness, we use batch sizes of 12 and 6 for the single-task 
model and the two tasks of the multi-task model, respectively. Moreover, gradient 
scaling (5) is applied at all connections between the encoder and the decoder. For 
further training details the interested reader is referred to [KTMFs20]. 


5 Experimental Results and Discussion 


In this section, we will first analyze the multi-task learning method w.r.t. the achieved 
absolute performance, where we will put the focus on how the performance can be 
improved without extensive hyperparameter tuning. Afterwards, the resulting models 
will be analyzed w.r.t. robustness against common image corruptions such as random 
noise or adversarial attacks. 


Absolute performance: While the main focus of this chapter is the improved robust- 
ness of semantic segmentation DNNs, this robustness improvement should not come 
at the price of a significantly decreased absolute performance. In [KBFs20] it was 
proposed to scale the gradients using a manually tunable hyperparameter (GW). How- 
ever, the results from Table 2 show that the absolute performance can decrease sig- 
nificantly, e.g., for GW (A = 0.9) on the KITTI dataset (right column in both tables), 
compared to the single-task baseline. As there is no general way to know which task 
weighting is optimal, we propose to use the GradNorm (GN) technique [CBLR18] 
instead of manual gradient weighting (GW). For this technique we observe that for 
all considered GN hyperparameters a and on both dataset settings, the GradNorm 
technique improves the absolute performance over the single-task baseline. Inter- 
estingly, for this task weighting strategy, the task weights change over the course 
of the learning process. In particular, side experiments showed that the final task 
weights at the end of the training process do yield a decreased performance, if they 
are used constantly throughout the whole training process. This shows the impor- 
tance of adapting them in dependence of the task-specific training progresses. Still, 
optimal performance is sometimes rather reached with manual gradient weighting 
(e.g., Table 2a), however, for robustness purposes an optimal absolute performance 
is not as important as a stable convergence of the multi-task training. We therefore 
use the GradNorm technique (GN) instead of the manual gradient weighting for all 
following experiments w.r.t. DNN robustness. 


Robustness to input noise: As a simple next experiment, we compared the robustness 
w.r.t. Gaussian input noise between the baseline trained in a single-task fashion and 
models trained in a multi-task fashion with the GradNorm method. The results in 
Table 3a are obtained for a setting, where the segmentation and depth tasks are trained 
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Table 2 Absolute segmentation performance measured by the mloU [%] metric for models where 
the segmentation is trained on Cityscapes (XCS) and the auxiliary depth estimation is either trained 


train 
on KITTI (X KT top) or Cityscapes (X oo, bottom). We report numbers on the Cityscapes (X cs) 


train 


and KITTI (xS) validation sets for manual gradient weighting (GW) and the GradNorm (GN) 
multi-task training method. Best numbers are in boldface. Second best numbers are underlined 


(a) Segmentation: xs depth: aa 
Method mloU on xs mloU on ake 
Baseline 63.5 43.0 
GW (A = 0.1) 67.4 49.6 
GW (A = 0.2) 68.9 47.7 
GW (\ = 0.5) 67.4 34.7 
GW (A = 0.9) 67.7 29.8 
GN (a = 0.2) 66.8 44.0 
GN (a = 0.5) 67.0 47.0 
GN (a = 1.0) 66.4 48.5 
GN (a = 2.0) 65.7 45.6 
(b) Segmentation: x o depth: axess 
Method mloU on xs mloU on ASE 
Baseline 63.5 43.0 
GW (A = 0.1) 65.8 47.2 
GW (\ = 0.2) 66.6 46.0 
GW (A = 0.5) 65.1 44.9 
GW (à = 0.9) 66.1 40.5 
GN (a = 0.2) 66.1 46.1 
GN (a = 0.5) 67.9 48.3 
GN (a = 1.0) 67.1 45.5 
GN (a = 2.0) 66.5 48.3 


on Cityscapes and KITTL respectively. We clearly observe that all multi-task models 
(GN variants) exhibit a higher robustness measured by the mIoU ratio Q (11), e.g., 
Q = 27.5% to Q = 33.3% for a perturbation strength of € = 16, compared to the 
single task baseline (Q = 18.4%). The model variant GN (a = 0.5) is furthermore 
shown in Fig. 4. When looking at the curves for the Cityscapes and KITTI validation 
sets and for both Gaussian and salt and pepper noise, we observe that the multi-task 
model is consistently either on par or better w.r.t. robustness, measured by the mloU 
ratio Q. 

A slightly different behavior is observed when the auxiliary depth estimation 
task is trained on the same dataset as the semantic segmentation task, as shown 
in Table 3b. Here, for small perturbation strengths (e < 4), the robustness is still 
improved, however, for larger perturbation strengths, the robustness is either similar 
or even decreased. This can be interpreted as an indication that the additional data 
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Table 3 Robustness to Gaussian input noise measured by the mIoU ratio Q [%] (11) for models 


where the segmentation is trained on Cityscapes (X cs) and the auxiliary depth estimation is either 


trained on KITTI aKT, top) or Cityscapes ort bottom). Best numbers are in boldface 
(a) Segmentation: ASS a depth: ax 
€ 0.25 0.5 1 2 4 8 16 
Baseline | 100.0 99.7 98.6 93.7 75.1 44.7 18.4 
GN 100.0 99.9 99.0 96.1 85.5 60.8 28.1 
(a = 0.2) 
GN 100.0 99.9 99.6 97.3 87.2 61.9 29:7 
(a = 0.5) 
GN 100.0 100.0 99.5 97.1 88.1 65.4 33.3 
(a = 1.0) 
GN 100.0 99.8 99.2 96.2 86.2 60.7 27.5 
(a = 2.0) 
(b) Segmentation: XS depth: ae 
€ 0.25 0.5 1 2 4 8 16 
Baseline | 100.0 99.7 98.6 93.7 75.1 44.7 18.4 
GN 100.0 99.8 99.4 96.8 82.0 45.4 13.0 
(a = 0.2) 
GN 99.9 99.7 99.1 95.3 79.4 41.8 10.2 
(a = 0.5) 
GN 100.0 99.9 99.4 96.3 82.4 42.5 15.6 
(a = 1.0) 
GN 99.8 99.7 99.1 96.2 82.0 44.5 8.0 
(a = 2.0) 


from another domain is mainly responsible for the robustness improvement rather 
than the additional task itself. However, the self-supervised training technique of the 
auxiliary task is the precondition for being able to make use of additional unlabeled 
data. 


Robustness to adversarial attacks: In addition to simple noise conditions, we also 
investigate adversarial attacks, where the noise pattern is optimized to fool the net- 
work. In Table 4 we show results on robustness w.r.t. the FGSM adversarial attack. We 
again observe that when the auxiliary task is trained in a different domain (left-hand 
side), the robustness is significantly improved regardless of the perturbation strength. 
In contrast to simple noise perturbations, the FGSM attack can degrade performance 
even for very small and visually hard-to-perceive perturbation strengths (e < 1), for 
which robustness is still improved by our method. Moreover, we again observe that 
the robustness improvement is not as good, when the auxiliary depth task is trained 
in the same domain (right-hand side), as here the robustness improvement is not as 
high. For instance, for € = 8 the robustness improves from Q = 9.6% to Q = 33.4% 
for the best multi-task model, when the depth is trained out-of-domain, while for in- 
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Fig. 4 Robustness to several input perturbation types for models where the segmentation is trained 
on Cityscapes (XCS) and the auxiliary depth estimation on KITTI (XK! ). We report mIoU ratios 


train train 


Q [%] (11) on the Cityscapes (X cs) and KITTI (x. KIT) validation sets for the GradNorm (a = 0.5) 
multi-task training method 


domain training the improvement is only from Q = 9.6% to Q = 21.8%. Still, for the 
FGSM attack all multi-task models improve upon the baseline in terms of robustness, 
showing the general applicability of the GradNorm multi-task learning technique for 
robustness improvement. 

To also investigate this effect on a wider range of datasets and perturbations, 
we show robustness results on the Cityscapes and KITTI validation sets and for 
the FGSM and PGD adversarial attack in Fig. 4, bottom. For all considered cases, 
the robustness of the multi-task model GN (a = 0.5) improves upon the single-task 
baseline. We also show qualitative results in Fig. 5 for Gaussian noise (e = 8.0) and 
the FGSM attack (€ = 4.0) for the GN model (a = 0.5), where we also qualitatively 
observe a significant improvement over the single-task baseline. 
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Table 4 Robustness to the FGSM adversarial attack aos by the mloU ratio Q [%] (11) for 
models where the segmentation is trained on Cityscapes (XCS_) and the auxiliary depth estimation 


is either trained on KITTI (XST , top) or Cityscapes (X 


train? 


train 


re “4 bottom). Best numbers are in boldface 


(a) Segmentation: XSS depth: axr 

€ 0.25 0.5 1 2 4 8 16 
Baseline | 63.8 57.0 51.0 43.9 29.6 9.6 2.4 
GN 73.0 67.2 62.3 58.7 51.2 32.6 14.4 
(a = 0.2) 

GN 70.0 63.4 57.8 53.1 46.4 30.4 12.4 
(a = 0.5) 

GN 70.3 63.9 59.3 55.8 49.2 33.4 13.3 
(a = 1.0) 

GN 72.1 66.1 61.5 57.5 44.5 31.1 14.6 
(a = 2.0) 

(b) Segmentation: XS depth: a 

€ 0.25 0.5 1 2 4 8 16 
Baseline | 63.8 57.0 51.0 43.9 29.6 9.6 2.4 
GN 71.4 65.7 61.0 57.5 46.9 21.8 7.7 
(a = 0.2) 

GN 74.3 66.0 61.1 56.7 44.5 17.7 5.6 
(a = 0.5) 

GN 72.6 66.3 61.5 57.2 40.8 16.4 7.7 
(a = 1.0) 

GN 72.2 66.0 61.5 57.1 44.5 16.2 2.9 
(a = 2.0) 


Gaussian noise (e = 8.0) FGSM adversarial attack (€ = = 4. 0) 


Single-task Input image 


Multi-task 


Fig. 5 Qualitative result comparison for models where the segmentation is trained on Cityscapes 


(XCS.) and the auxiliary depth estimation on KITTI (XKIT), We show results for two different 
perturbations and compare between the single-task baseline and the GradNorm (a = 0.5) multi- 


task training method 
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6 Conclusions 


In this chapter, we show how the robustness of a semantic segmentation DNN can be 
improved by multi-task training with the auxiliary task of depth estimation, which 
can be trained in a self-supervised fashion on arbitrary videos. We show this robust- 
ness improvement across two noise perturbations and two adversarial attacks, where 
we ensure comparability of different perturbations in terms of strength. By making 
use of the GradNorm task weighting strategy, we are able to remove the necessity 
for manual tuning of hyperparameters, thereby achieving a stable and easy-to-apply 
robustness improvement. Also, we show that in-domain training with the additional 
task of depth estimation already improves robustness to some degree, while out-of- 
domain training on additional unlabeled data enabled by the self-supervised training 
improves robustness even further. Moreover, our method is easy-to-apply, induces no 
computational overhead during inference, and even improves absolute performance, 
which can be of interest for applications such as autonomous driving, virtual reality, 
or medical imaging as long as additional video data is available. Our method thereby 
demonstrates various further advantages of multi-task training for semantic segmen- 
tation, which could be potentially generalizable to various further computer vision 
tasks. 
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they are vulnerable to adversarial perturbations. Recent works have proven the exis- 
tence of universal adversarial perturbations (UAPs), which, when added to most 
images, destroy the output of the respective perception function. Existing attack 
methods often show a low success rate when attacking target models which are 
different from the one that the attack was optimized on. To address such weak trans- 
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model architectures, with a cross-entropy loss. Experimental results on ImageNet and 
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1 Introduction 


Reaching desired safety standards quickly is of utmost importance for automated 
vehicles (AVs), in particular, for their environment perception module. Recently, 
growing advancements in deep neural networks (DNNs) gave the means to researchers 
for solving real-world problems, specifically improving the state of the art in envi- 
ronment perception [GTCM20, NVM+19]. These networks can help AVs in under- 
standing the environment, such as identifying traffic signs and detecting surrounding 
objects, by incorporating many sensors (e.g., camera, LIDAR, and RaDAR) to build 
an overall representation of the environment [WLH+20, FJGN20], or even to provide 
an end-to-end control for the vehicle [KBJ+20], see Fig. 1. 

A large body of studies has addressed adversarial attacks [SZS+14, MDFF+18] 
and robustness enhancement [GRM+19, SOZ+18] of DNN architectures. In AVs, it 
was shown that adversarial signs could fool a commercial classification system in 
real-world driving conditions [MKMW 19]. In addition to the vulnerability of image 
classifiers, Arnab et al. [AMT18] extensively analyzed the behavior of semantic 
segmentation architectures for AVs and illustrated their susceptibility to adversarial 
attacks. To reach the high level of automation, defined by the SAE standard J3016 
[SAE18], car manufacturers have to consider various threats and perturbations tar- 
geting the AV systems. 

In vision-related tasks, adversarial perturbations can be divided into two types: 
image-dependent (per-instance) adversarial perturbations and universal adversarial 
perturbations (UAPs). In image-dependent adversarial perturbations, a specific opti- 
mization has to be performed for each image individually to generate an adversarial 
example. In contrast, UAPs are more general perturbations in the sense that an addi- 
tive single perturbation triggers all the images in a dataset to become adversarial 
examples. Therefore, they are much more efficient in terms of computation cost and 
time when compared to image-dependent adversarial attacks [CAB+20]. However, 
while showing a high success rate on the models they are optimized on, they lack 
transferability to other models. 

In this chapter, we aim to specifically address the vulnerability of image classi- 
fication and semantic segmentation systems as cornerstones of visual perception in 
automated vehicles. In particular, we aim at improving the transferability of task- 
specific universal adversarial perturbations. Our major contributions are as follows: 

First, we present a comprehensive similarity analysis of features from several 
layers of different DNN classifiers. 

Second, based on these findings, we propose a novel fooling loss function for 
generating universal adversarial perturbations. In particular, we combine the fast 
feature fool loss [MGR19], however, focusing on the first layer only, with the cross- 
entropy loss, to train an attack generator with the help of a source model for generating 
targeted and non-targeted UAPs for any other target model. 

Third, we show that our UAPs not only exhibit remarkable transferability across 
multiple networks trained on the same dataset, but also they can be generated using a 
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reduced subset of the training dataset, while still having a satisfactory generalization 
power over unseen data. 

Fourth, using our method, we are able to surpass state-of-the-art performance in 
both white-box and black-box settings. 

Finally, by extensive evaluation of the generated UAPs on various image clas- 
sification and semantic segmentation models, we demonstrate that our approach is 
generalizable across multiple vision tasks. 

The remainder of this chapter is organized as follows. In Sect. 2, we present 
some background and related works. The proposed method, along with mathematical 
notations, is introduced in Sect. 3. Experimental results on image classification and 
semantic segmentation tasks are then presented in Sects. 4 and 5, respectively. Finally, 
in Sect. 6 we conclude the chapter. 


2 Related Works 


2.1 Environment Perception for Automated Vehicles 


Figure | gives an overview of the major components comprised within an automated 
vehicle: Environment perception, motion planning, and motion control [GR20]. The 
environment perception module collects data using several sensors to obtain an over- 
all representation of the surroundings. This information is then processed through 
the motion planning module to calculate a reasonable trajectory, which is executed 
via the motion control module at the end. In this chapter, we concentrate on the 
environment perception of automated vehicles. 

The environment perception of automated vehicles comprises several sensors, 
including radio detection and ranging (RaDAR), light detection and ranging (LIDAR), 
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Fig. 1 Environment perception and subsequent functions in an automated vehicle (AV), acc. to 
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and camera [RT 19]. RaDAR sensors supplement camera vision in times of low visi- 
bility, e.g., night driving, and are able to improve the detection accuracy [PTWA17]. 
LiDAR sensors are commonly used to make high-resolution maps, and they are capa- 
ble of detecting obstacles [WXZ20]. The recent rise of deep neural networks puts a 
high interest to camera-based environment perception solutions, e.g., object detection 
and classification [ZWJW20], as well as semantic segmentation [RABA18]. 

In this chapter, we focus on camera-based environment perception with deep 
neural networks for image classification as well as for semantic segmentation in the 
context of AVs. 


2.2 Adversarial Attacks 


It is well known that deep neural networks are vulnerable to adversarial 
attacks [SZS+14]. While adversarial attacks for image classification have been widely 
studied, the influence of these attacks in other applications, such as semantic segmen- 
tation, has rarely been investigated [AM18, BHSFs19, BZIK20, BKV+20, KBFs20, 
ZBIK20a, BLK+21]. In this section, we review existing attack methods in both clas- 
sification and semantic segmentation perception tasks. 

There are several possible ways of categorizing an adversarial attack, e.g., 
targeted/non-targeted attack, and white-box/black-box attack [HM19]. In a targeted 
attack, the adversary generates an adversarial example to change the system’s output 
to a maliciously chosen target label. In a non-targeted attack, on the other hand, 
the attacker wants to direct the result to a label that is different from the ground 
truth, no matter what it is. It should be noted that in semantic segmentation, targeted 
attacks can be divided into static and dynamic target segmentation [MCKBF17]. 
Static attacks are similar to the targeted attacks in image classification. Here, the 
aim is to enforce the model to always output the same target semantic segmentation 
mask. The dynamic attack follows the objective of replacing all pixels classified with 
a beforehand chosen target class by its spatial nearest neighbor class. Another way of 
attack categorization is based on the attackers’ knowledge of the parameters and the 
architecture of the target model. While a white-box scenario refers to the case when 
an attacker has full knowledge about the underlying model, a black-box scenario 
implies that no information about the respective model is available. We address both 
targeted and non-targeted attacks as well as white-box and black-box scenarios in 
this chapter. 

In the following, we will investigate two types of image-dependent and image- 
agnostic adversarial perturbations in classification and semantic segmentation mod- 
els. We will also review the problem of transferability in adversarial attacks. 


Image-dependent adversarial perturbations: There are various iterative, non- 
iterative, and generative methods for crafting image-dependent adversarial exam- 
ples. Goodfellow et al. [GSS15] introduced the fast gradient sign method (FGSM), 
one of the first adversarial attacks. FGSM aims at computing the gradients of the 
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source model S with respect to the (image) input x and a loss J to create an adversarial 
example. Iterative FGSM (I-FGSM) [KGB 17] iteratively applies FGSM with a small 
step size, while momentum iterative FGSM [DLP+18] utilizes a momentum-based 
optimization algorithm for stronger adversarial attacks. Another iterative attack is 
projected gradient descent (PGD) [MMS+18], where the main difference to -FGSM 
is random restarts in the optimization process. Kurakin et al. [KGB17] proposed the 
least-likely class method (LLCM), in which the target is set to the least-likely class 
predicted by the network. Xiao et al. [XZL+18] introduced the spatial transform 
attack (STM) for generating adversarial examples. In STM, instead of changing the 
pixel values in a direct manner, spatial filters are employed to substitute pixel val- 
ues maliciously. Also, some works [HM19] have employed a generative adversarial 
network (GAN) [GPAM+14] for generating adversarial examples. 

A basic analysis on the behavior of semantic segmentation DNNs against adver- 
sarial examples was performed by Arnab et al. [AMT18]. They applied three com- 
monly known adversarial attacks for image classification tasks, i.e., FGSM [GSS15], 
Iterative-FGSM [KGB17], and LLCM [KGB17], to semantic segmentation models 
and illustrated the vulnerability of this task. There are also some works that concen- 
trate on sophisticated adversarial attacks that lead to more reasonable outcomes in 
the context of semantic segmentation [XDL+18, PKGB18]. 


Universal adversarial perturbations: Universal adversarial perturbations (UAPs) 
were firstly introduced by Moosavi-Dezfooli et al. as image-agnostic perturbations 
[MDFFF17]. Similar to image-dependent adversarial attacks, there are some itera- 
tive, non-iterative, and generative techniques for creating UAPs. An iterative algo- 
rithm to generate UAPs for an image classifier was presented in [MDFF+18]. The 
authors provided an analytical analysis of the decision boundary in DNNs based 
on geometry and proved the existence of small universal adversarial perturbations. 
Some researchers focused on generative models that can be trained for generating 
UAPs [HD18, RMOGVB18]. Mopuri et al. presented a network for adversary gen- 
eration (NAG) [RMOGVB 18], which builds upon GANs. NAG utilizes fooling and 
diversity loss functions to model the distribution of UAPs for a DNN image classifier. 
Moreover, Poursaeed et al. [PKGB18] introduced the generative adversarial pertur- 
bation (GAP) algorithm for transforming noise drawn from a uniform distribution to 
adversarial perturbations to conduct adversarial attacks in classification and semantic 
segmentation tasks. Metzen et al. [MCKBFI17] proposed an iterative algorithm for 
semantic segmentation, which led to more realistically looking false segmentation 
masks. 

Contrary to the previous data-dependent methods, Mopuri et al. [|MGR19] intro- 
duced fast feature fool (FFF), a data-independent algorithm for producing non- 
targeted UAPs. FFF aims at injecting maximal adversarial energy into each layer of 
the source model S. This is done by the following loss function: 


L 
JF) = SOF), JPR) = — log(||Ac()|l2). (1) 
f=1 
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where A¢(r) is the mean of all feature maps of the £-th layer (after the activation 
function in layer £), when only the UAP r is fed into the model. Note that usually 
xv — x + r, with clean image x, is fed into the model. This algorithm starts with 
a random r which is then iteratively optimized. For mitigating the absence of data 
in producing UAPs, Mopuri et al. [MUR18] proposed class impressions (CIs), a 
form of reconstructed images that are obtained via simple optimization from the 
source model. After finding multiple CIs in the input space for each target class, 
they trained a generator to create UAPs. Recently, Zhang et al. [ZBIK20b] proposed 
a targeted UAP algorithm using random source images (TUAP-RSIJ) from a proxy 
dataset instead of the original training dataset. 

In this chapter, we follow Poursaeed et al. [PKGB18] by proposing an efficient 
generative approach that focuses on propagating adversarial energy in a source model 
to generate UAPs for the task of image classification and semantic segmentation. 


Transferability in black-box attacks: The ability of an adversarial example to be 
effective against a different, potentially unknown, target model is known as trans- 
ferability. Researchers have evaluated the transferability of adversarial examples on 
image classifiers [MGR19, MDFFF17, PXL+20, LBX+20] and semantic segmen- 
tation networks [PKGB18, AMT18]. 

Regarding the philosophy behind transferability, Goodfellow et al. [GSS15] 
demonstrated that estimating the size of adversarial subspaces is relevant to the 
transferability issue. Another potential perspective lies in the similarity of decision 
boundaries. Learning substitute models, approximating the decision boundaries of 
target models, is a famous approach to attack an unknown model [PMJ+16]. Wu et 
al. [WWX+20] considered DNNs with skip connections and found that using more 
gradients from the skip connections, rather than the residual modules, allows the 
attacker to craft more transferable adversarial examples. Wei et al. [WLCC18] pro- 
posed to manipulate feature maps, extracted by a separate feature network, to create 
more transferable image-dependent perturbations using a GAN. Li et al. [LBZ+20] 
introduced a virtual model known as a ghost network to apply feature-level perturba- 
tions to an existing model to produce a large set of diverse models. They showed that 
ghost networks, together with a coupled ensemble strategy, improve the transferabil- 
ity of existing techniques. Wu et al. [WZTE18] empirically investigated the depen- 
dence of adversarial transferability on model-specific attributes, including model 
capacity, architecture, and test accuracy. They demonstrated that fooling rates heav- 
ily depend on the similarity of the source model and target model architectures. 

In this chapter, we increase the tranferability of generated UAPs by including a 
loss term inspired by Mopuri et al. [|MGR19] focusing on the adversarial energy in 
early layers. 
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3 Method 


3.1 Mathematical Notations 


We first introduce notations for image classification and semantic segmentation in a 
natural domain without attacks and then extend this to the adversarial domain. 


Natural Domain: Consider a classifier S trained on a training set X""" having 
M different classes. This network assigns a label m = S(x) € M = {1,..., M} to 
each input image x from training set X"*" or test set Xt. We assume that image 
x € I” *W*C is a clean image, meaning that it contains no adversarial perturbations, 
with height H, width W, C = 3 color maps, and I = [0, 1]. Each image x is tagged 
with a ground truth label m € M. 

In semantic segmentation, given an input image x, we assign each pixel with 
a class label. In this case, the semantic segmentation network S$ outputs a label 
mapm = S(x) € M! foreach input image x = (x), ..., X;, ..., X7), with Z = H x W. 
Similar to before, each pixel x; € I of an image x is tagged with a ground truth label 
m; € M, resulting in the ground truth label map m for the entire image x. 


Adversarial domain: Let x?® be an adversarial example that belongs to the adver- 
sarial space of the network; for example, for the network S this space is defined 
as X ady, In order to have a quasi-imperceptible perturbation r when added to clean 
images to obtain adversarial examples, i.e., x°% = x + r, it is bounded by ||r|| pe 
with € being the supremum of a respective p-norm || - || p. 

In case of classification networks, for each x**Y € XV, the purpose of non- 
targeted attacks is to obtain S(x**’) A m. In targeted attacks, the attacker tries to 
enforce S(x?™) = m + m, where m denotes the target class the attacker aims at. 

In case of semantic segmentation networks, for each x?® € X oe in non-targeted 
attacks we aim at S(x**’) = m = (m;) with m;i Æ m; foralli € J = {1,..., I}. On 
the other hand, in static targeted attacks, the goal is S(x**’) = ra ¢ m, where m 
denotes the target mask of the attacker. 

Also, let T be the target model under attack, which is a deep neural network 
(DNN) with given (frozen) parameters, pretrained on some datasets to perform a 
specific environment perception task. We define z as a random variable sampled 
from a distribution, which is fed to a UAP generator G to produce a perturbation 
r = G(z). Also, J stands for a loss function, which is differentiable with respect to 
the network parameters and the input. 


3.2 Method Introduction and Initial Analysis 


In order to improve the robustness of DNNs for environment perception, knowledge 
of sophisticated image-agnostic adversarial attacks is needed, capable of working in 
both white-box and black-box fashions. 
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Fig. 2 Proposed training approach for a UAP generator for non-targeted attacks (label y) and 
targeted attacks (y) 


In this section, we present our proposed approach to generate UAPs. It builds 
upon the adversarial perturbation generator proposed by Poursaeed et al. [PKGB 18]. 
Unlike [PKGB 18], we focus on a fooling loss function for generating effective UAPs 
in both white-box and black-box scenarios. We employ a pretrained DNN as the 
source model S, which is exposed to UAPs during training the UAP generator. Our 
goal is to find a UAP r by an appropriate loss function, which is able to not only 
deceive the source model S on a training or test dataset X""" or X*“, respectively, 
but also to effectively deceive a target model T, for which T Æ S holds. 

Figure2 gives an overview of our UAP generator training methodology. Let 
G(z) = r be the UAP generator function mapping an unstructured, random multi- 
dimensional input z ~ U*"*C sampled from a prior uniform distribution U = I, 
onto a perturbation r e I¥*W*C, To obtain a p-norm scaled r, a preliminary 
obtained perturbation r’ is bounded by multiplying the generator network raw out- 


put r’ = G’(z) with min(1, wor?) Next, the resulting adversarial perturbation r 
P 


is added to an image x € X" before being clipped to a valid range of RGB image 
pixel values, resulting in an adversarial example x**’. Finally, the generated adver- 
sarial example x’ is fed to a pretrained source model S to compute the adversarial 
loss functions based on targeted or non-targeted attack types. 

To increase the model transferability of the generated UAPs, we seek similar- 
ities between different pretrained DNNs to take advantage of this property. For 
this, we selected some state-of-the-art classifiers such as VGG- 16, VGG-19 [SZ15], 
ResNet-18, and ResNet-152 [HZRS16], all pretrained on ImageNet [RDS+15], 
to investigate their extracted feature maps in different levels. Then, we first measure 
the similarity of the mean feature maps of a layer between all networks over the entire 
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Fig. 3 Mean squared error (MSE) janet! (x) -— Anet2 (x) Iż between the mean of feature represen- 
tations A¢(x) in layers £ € {1, 2, 3, 4, 5, 6} of different DNN classifiers pretrained on the ImageNet 
dataset. The results are reported for images x from the ImageNet validation dataset. All networks 
show a considerable similarity in terms of MSE in the first layer, whereas similarity in later layers 
is only seen among VGG-16 and VGG-19 [SZ15] 


L=1 €=2 €=3 (=4 €=5 €=6 
vcc-16 ESAN i 
[SZ15] 
VGG-19 
[SZ15] ma 
ResNet-18 $ 
[HZRS16] | 
ResNet-152% 
[HZRS16] 


Fig. 4 The layer-wise mean of feature representations Ay (x) within different pretrained classifiers, 
computed for the first six layers for a random input image x taken from the ImageNet validation 


dataset. High similarity is observed in the first layers, in the later layers only among the VGG-16 
and the VGG-19 network 


ImageNet [RDS+15] validation set, using the well-known and universally applicable 
mean squared error (MSE).! Figure3 displays the resulting heat maps. In addition, 
Fig. 4 shows the mean of feature representations A, (x) for these four pretrained clas- 
sifiers computed for layer £ = 1 up to £ = 6 (after each activation function) for a 
selected input image x. Both figures, Figs. 3 and 4, show that the respective networks 
share a qualitatively and quantitatively high similarity in the first layer compared to 
all subsequent layers. Only for close relatives, such as VGG-16 and VGG-19, this 
similarity is found in later layers as well. We thus hypothesize that by applying the 
fast feature fool loss Jf (1) only to the first layer of the source model during train- 


1 We also evaluated the similarity of feature maps in different layers by the structural similarity index 


(SSIM) [WBSS04] and peak signal to noise ratio (PSNR) [HZ10]. The results are very similar to 
Fig. 3. 
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ing, we not only inject high adversarial energy into the first layer but also increase 
the transferability of the generated UAPs. 

In the following, we formulate our fooling loss and consider the non-targeted and 
targeted attack cases (see Fig. 2). 


3.3 Non-targeted Perturbations 


In the non-targeted case, we want to fool the source model S so that its prediction 
S(x") differs from the ground truth one-hot representation y. In the simplest and 
most sensible possible way, researchers define the negative cross-entropy as the 
fooling loss for non-targeted attacks, while Poursaeed et al. [PKGB18] proposed the 
logarithm of this loss function. 

For image classification, we define the generator fooling loss for our non-targeted 
attacks as 


Jad nontargeted =-q: JES (x), y) + (d— a) i JEE a), (2) 


where JCE denotes the cross-entropy loss, Y = O) € {0, 1}” is the one-hot encod- 

ing of the ground truth label m for image x, and u being the class index, such that 

m = arg max Y, - Also, let J FFF (x9) be the fast feature fool loss of layer £ = 1 (see 
HE 


(1)), when x**’ is fed to the network S. Further, the cross-entropy loss is defined as 


JEY, Y) = — J ¥, log (yu) = — log Om), (3) 
HEM 


where y = S(x**’) = ( Yu) € I is the network output vector of S with the predictions 
for each class u. For optimization, we utilize Adam [KB 15] in standard configuration, 
following [PKGB 18]. 

For semantic segmentation, in (2) y = S(x**’) € I#*W*M andy e {0, 1}¥#*WxM 
holds, and the cross-entropy loss in (3) is changed to 


JEY, =— J yp log (in) 


icl eM 


= > log (Yim) ; 


ieL 


(4) 


with y; u, Y;,,, being the prediction and ground truth for class p at pixel i, respectively, 
and y;,m, being the prediction at pixel i for the ground truth class m; of pixel i. 
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3.4 Targeted Perturbations 


Different from the non-targeted case, the goal of a targeted one is to let the DNN 
output S(x**’) take on values ý defined by the attacker, which usually differ from the 
ground truth (if source target labels align with the ground truth of a particular image, 
the UAP will output that ground truth label). Hence, the attacker aims to decrease 
the cross-entropy loss with respect to a target until the source model S predicts the 
selected target class with high confidence. As before, we add the fast feature fool 
loss in the first layer to boost the transferability of the targeted generated UAP. 
For image classification, our generator fooling loss for targeted attacks is 


Jadvitargeted =Q- Ji), y) 4 a — a) 7 ae, (5) 


where y € {0, 1} is the one-hot encoding of the target label mm + m. Note that the 
sign of the cross entropy is flipped compared to the non-targeted case (2). Then, 
similar to the non-targeted attack, the Adam optimizer is utilized. 

If we consider semantic segmentation, the ground truth y in (5) becomes a one-hot 
encoded semantic segmentation mask ý € {0, 1}7*¥*™, 


4 Experiments on Image Classification 


In this section, we will first describe the dataset, network architectures, and evaluation 
metrics, which are used to measure the performance of generated UAPs on image 
classifiers. Afterward, we will analyze the effectiveness of the proposed fooling 
method on state-of-the-art image classifiers and will compare it to other state-of-the- 
art attack methods. 


4.1 Experimental Setup 


Dataset: We use the ImageNet dataset [RDS+15], which is a large collection of 
human-annotated images. For all our experiments, a universal adversarial perturba- 
tion is computed for a set of 10,000 images taken from the ImageNet training set X" 
(i.e., 10 images per class) and the results are reported on the ImageNet validation set 
X*! (50,000 images). 


Network architectures: There are several design options regarding the architecture 
choices for generator G and source model S. For our generator, we follow [ZPIE17] 
and [PKGB18] and choose the ResNet generator from [JAFF16], which consists 
of some convolution layers for downsampling, followed by some residual blocks, 
before performing upsampling using transposed convolutions. As topology for the 
source model S, we utilize the same set of pretrained image classifiers as for the target 
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Table 1 Fooling rates (%) of our proposed non-targeted UAPs in white-box attacks, for different 
values of a and various target classifiers pretrained on ImageNet. Results are reported on a second 
training set of 10,000 images. The adversarial perturbation is bounded by Loo(r) < e = 10. The 
highest fooling rates (%) are printed in boldface 


a Source Model S = Target Model T Avg 
VGG-16 VGG-19 ResNet-18 |ResNet-152 
0 8.52 7.02 
0.6 90.49 89.32 
0.7 95.20 93.79 89.16 87.05 91.30 
0.8 90.03 93.24 89.07 89.91 90.56 
0.9 95.13 92.14 88.34 89.37 91.24 
1 92.87 71.88 88.88 85.34 84.74 


Table 2 Fooling rates (%) of our proposed non-targeted UAPs in white-box attacks, for various 
target classifiers pretrained on ImageNet. Results are reported on the ImageNet validation set. We 
report results for two Lp norms, namely Lo(r) < € = 2000 and Loo(r) < € = 10 


p € a Source Model S = Target Model T 

VGG-16 |VGG-19 |ResNet-18 | ResNet-152 
2 2000 0.7 96.57 94.99 91.85 88.73 
o0 10 0.7 95.70 94.00 90.46 90.40 


model T, i.e., VGG-16, VGG-19 [SZ15], ResNet-18, ResNet-152 [HZRS16], 
and also GoogleNet [SLJ+15]. 


Evaluation metrics: We use the fooling rate as our metric to assess the performance 
of our crafted UAPs on DNN image classifiers [MDFFF17, PKGB18, MUR18, 
MGR19]. In the case of non-targeted attacks, it is the percentage of input images 
for which T (x**’) Æ T(x) holds. For targeted attacks, we calculate the top-1 target 
accuracy, which can be understood as the percentage of adversarial examples, that 
is classified “correctly" as the target class as desired by the attacker. 


4.2 Non-Targeted Universal Perturbations 


According to Fig.2, we train our model with the non-targeted fooling loss (2). For 
tuning the hyperparameter a, the weight of our novel adversarial loss components, 
we utilized a second training set of 10,000 random images (again 10 images per 
class) taken from the ImageNet training set which is disjoint from both the training 
and the validation dataset. Table 1 shows that the best a for non-targeted attacks, on 
average over all model topologies, is a = 0.7. 

For white-box attacks, where § = T holds, results on the ImageNet validation set 
for two different norms are given in Table 2. The maximum permissible L, norm of 
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Table 3 Fooling rates (%) of various non-targeted state-of-the-art methods on various target clas- 
sifiers trained on ImageNet (white-box attacks). The results of other state-of-the-art methods are 
reported from the respective paper. For comparison reasons, the average of our method leaves out 
the ResNet-18 model results. Highest fooling rates are printed in boldface 


Pp € Method |S=T Avg* 
VGG-16 | VGG-19 | ResNet-18 | ResNet-152 

oo *10 FFF 47.10 43.62 - 29.78 40.16 
CIs 71.59 72.84 - 60.72 68.38 
UAP 78.30 77.80 - 84.00 80.03 
GAP 83.70 80.10 - 82.70 82.16 
NAG 77.57 83.78 - 87.24 82.86 
TUAP- | 94.30 94.98 - 90.08 93.12 
RSI 
Ours 95.70 94.00 90.46 90.40 93.36 

2 2000 UAP 90.30 84.50 - 88.50 87.76 
GAP 93.90 94.90 - 79.50 89.43 
Ours 96.57 94.99 91.85 88.73 93.43 


(a) The UAP r (left) and the respective adversarial examples x**” = x + r (right), with L2 (r) < 2000. 


efile ct one pe cee 


(c) The UAP r (left) and the respective adversarial examples x*"” = x + r (right), with L..(r) < 10. 


Fig. 5 Examples of our non-targeted UAPs and adversarial examples. In a the universal adversarial 
perturbation is given on the left and eight different adversarial examples are shown on the right, 
where the L2 norm of the adversarial perturbation is bounded by e = 2000, i.e., La (r) < 2000. In b 
the respective original images are shown, whereas in c the Loo norm of the adversarial perturbation 
is bounded by e = 10, i.e., Logo(r) < 10, a = 0.7. In these experiments, the source model S is 
ResNet-18 [HZRS16]. The pixel values in UAPs are scaled for better visibility 


the perturbations for p = 2 and p = œ is set to be e = 2000 and e = 10, respec- 
tively, following [MDFFF17]. As Moosavi-Dezfooli et al. [MDFFF17] pointed out, 
these values are selected to acquire a perturbation whose norm is remarkably smaller 
than the average image norms in the ImageNet dataset to obtain quasi-imperceptible 
adversarial examples. The results in Table 2 show that the proposed method is suc- 
cessful in the white-box setting. For the Læ norm, all reported fooling rate numbers 
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Table 4 Transferability of our proposed non-targeted UAPs (white-box and black-box attacks). 
Results are reported in the form of fooling rates (%) for various combinations of source models 
S and target models T, pretrained on ImageNet. The generator is trained to fool the source model 
(rows), and it is tested on the target model (columns). The adversarial perturbation is bounded by 
Lo (r) < e = 10, a = 0.7. *The average is computed excluding the easier white-box attacks (main 
diagonal) 


Target model T Avg* 
VGG-16 | VGG-19 | ResNet-18 | ResNet-152 
Source VGG-16 95.70 86.67 49.98 36.34 57.66 
model $ 
VGG-19 84.77 94.00 47.24 36.46 56.15 
ResNet-18 | 76.49 72.18 90.46 50.46 66.37 
ResNet-152 | 86.19 82.36 76.04 90.40 81.53 


are above 90%. To illustrate that our adversarial examples are quasi-imperceptible to 
humans, we illustrate some adversarial examples with their respective scaled UAPs 
in Fig. 5. 

In Table 3, we compare our proposed approach in generating non-targeted UAPs 
with state-of-the-art methods, i.e., fast feature fool (FFF, as originally proposed, on 
all layers) [MGR19], class impressions (CIs) [MUR18], universal adversarial per- 
turbation (UAP) [MDFFF17], generative adversarial perturbation (GAP) [PKGB 18], 
network for adversary generation (NAG) [RMOGVB 18], and targeted UAP-random 
source image (TUAP-RSJ) [ZBIK20b]. In these experiments, we again consider the 
white-box setting, i.e., S = T. Our proposed approach achieves a new state-of-the-art 
performance on almost all models on both L, norms, being on average 4% absolute 
better in fooling rate with the Lz norm, and at least 0.26% absolute better with the 
Læ norm. 

In Table 4 we investigate the white-box (S = T) and black-box (S # T) capability 
of our proposed method through various combinations of source and target models. 
Note that the rightmost column represents an average over the fooling rates in the 
black-box settings and thus indicates the transferability of our proposed method. 
Overall, in the white-box and black-box settings, we achieve fooling rates of over 90% 
and 55%, respectively. We also compare the transferability of our produced UAPs 
with the same state-of-the-art methods as before. The results for these experiments 
are shown in Table 5, where VGG-16, ResNet-152, and GoogleNet are used as 
the source model in Table 5a—c, respectively. It turns out to be advisable to choose a 
deep network as the source model (ResNet-152); since then our performance on 
the unseen VGG-16 and VGG-19 target models is about 12% absolute better than 
earlier state of the art (Lə norm). 

For investigating the generalization power of UAPs w.r.t. unseen data, we evaluate 
the influence of the size of the training dataset X"! on the quality of UAPs in con- 
ducting white-box and black-box attacks. Figure 6 shows the fooling rates obtained 
for VGG-16 as the target model T, on the ImageNet validation set for different sizes 
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Table 5 Transferability of our proposed non-targeted UAPs compared to other methods, i.e., FFF 
[MGR19], CIs [MUR18], UAP [MDFFF17], GAP [PKGB18], and NAG [RMOGVB18], using 
different source models S and target models T. The UAP is bounded by Lao (r) < € = 10. Values 
of our method are taken from Table 4 (except GoogleNet). *Note that the results are reported 
from the respective paper. Highest fooling rates are printed in boldface 


(a) S: VGG-16 [SZ15] 


T Method Fooling Rate (%) 
VGG-19 FFF* 41.98 
Cls* 65.64 
UAP* 73.10 
GAP 79.14 
NAG* 73.25 
Ours 86.67 
ResNet-152 FFF* 27.82 
Ci” 45.33 
UAP* 63.40 
GAP 30.32 
NAG* 54.38 
Ours 36.34 
ResNet-18 Ours 49.98 
(b) S: ResNet-152 [HZRS16] 
T Method Fooling Rate (%) 
VGG-16 FFF* 19.23 
Cht 47.21 
UAP* 47.00 
GAP 70.45 
NAG* 52.17 
Ours 86.19 
VGG-19 FFF* 17.15 
CIs* 48.78 
UAP* 45.50 
GAP 70.38 
NAG* 53.18 
Ours 82.36 
ResNet-18 Ours 76.04 
(c) S: GoogleNet [SLJ+15] 
T Method Fooling Rate (%) 
VGG-16 FFF* 40.91 
Cls* 59.12 
UAP* 39.20 
GAP 71.14 
NAG* 56.40 
Ours 75.35 
ResNet152 FFF* 25.31 
ClIs* 47.81 
UAP* 45.50 
GAP 51.72 
NAG* 59.22 
Ours 61.24 
ResNet-18 Ours 67.30 
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Fig.6 Fooling rates (%) of our non-targeted UAPs on the ImageNet validation set for different sizes 


of X'in in both white-box and black-box settings. In the white-box setting, the source model S and 
the target model T are VGG- 16, while in the black-box setting, the source model S is ResNet-152 


and the target model T is again VGG-16. Results are reported for Loo(r) < € = 10 anda = 0.7 
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Fig. 7 Fooling rates (%) of our non-targeted UAPs on the ImageNet validation set, in both white- 
box and black-box attacks for Læ and L3, for different layers £ in (1) applied in the loss function 
(2). In the white-box setting, the source model S' and the target model T are ResNet-18. In the 
black-box setting, the source model S is again ResNet-18 and the target model T is VGG-16. 
Results are reported for a = 0.7 


of X", The results show that by using a dataset X"" containing only 1000 images, 
our approach leads to a fooling rate of more than 60% on the ImageNet validation 
dataset in both the white-box and black-box settings. Additionally, the number of 
images in X" turns out to be more vital for the fooling rate of black-box attacks as 
compared to white-box attacks. 

To examine the impact of the layer which is used in our loss function, we utilized 
different layers in (1), then applied them in the loss function (2) to train the source 
model for generating UAPs. In practice, we are interested in the impact the layer 
position has. Figure7 shows the fooling rate in white-box and black-box settings 
for Læ and L2, when different layers from £ = 1 to £ = 6 are applied in (1) and 
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(2). This figure shows that choosing deeper layers leads to a decreasing trend in the 
fooling rate. Also, this trend is stronger in black-box attacks, where a more than 20% 
drop in the attack success rate can be observed. This indicates that it is advisable to 
choose earlier layers, in particular the first layer, in generating UAPs to obtain both 
an optimal fooling rate and transferability. 


4.3 Targeted Universal Perturbations 


In this section, we apply the targeted fooling loss (5), again with a = 0.7, for training 
the generator in Fig. 2. We assume the chosen a is appropriate for targeted attacks as 
well and thus dispense further investigations. If results are good, then this supports 
some robustness w.r.t. the choice of a. Figure 8 depicts two examples of our targeted 
UAPs, some original images and respective adversarial examples. In these exper- 
iments, the fooling rate (top-1 target accuracy) on the validation set for the target 
class m = 919 (street sign) and m = 920 (traffic light, traffic 
signal, stoplight) are 63.2 and 57.83%, respectively, which underlines the 
effectiveness of our approach. 

For assessing the generalization power of our proposed method across different 
target classes and comparison with GAP [PKGB 18], we followed Poursaeed et al. and 
used 10 randomly sampled classes. The resulting average top-1 target accuracy, when 
the adversarial perturbation is bounded by L.(r) < e = 10, is 66.57%, which is 
significantly higher than the one reported for GAP [PKGB18] with 52.0%. 


(a) The UAP r for target label traffic light, signal, stoplight (left) and the respective adversarial 
examples x**¥ (right). 


(c) The UAP r for target label street sign (left) and the respective adversarial examples x**’ (right). 


Fig. 8 Examples of our targeted UAPs and adversarial examples. In these experiments, the source 
model S is VGG-16 [SZ15], with L.o(r) < € = 10, a = 0.7. The pixel values in UAPs are scaled 
for better visibility 
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Fig. 9 Fooling rates (%) of our targeted UAPs on the ImageNet validation set for different sizes 
of X'in in both white-box and black-box attacks. In the white-box setting, the source model 
S and the target model T are VGG-16, while in the black-box setting, the source model S is 
again VGG-16 and the target model T is VGG-19. Results are reported for Loo(r) < e = 10, 
a = 0.7, and target class m = 847 (tank, army tank, armored combat vehicle, 
armoured combat vehicle) 


To demonstrate the generalization power of our targeted UAPs w.r.t. unseen data, 
we visualize the attack success rate obtained by the source model VGG- 16, in white- 
box and black-box settings, on the Image-Net validation dataset for different sizes 
of X**" in Fig.9. For instance, with X"! containing 10,000 images, we are able 
to fool the target model on over 20% of the images in the ImageNet validation set. 
It should be noted that training the generator G to produce a single UAP forcing 
the target model to output a specific target class m is an extremely challenging task. 
However, we particularly observe that utilizing 10,000 training images again seems 
to be sufficient for a white-box attack. 


5 Experiments on Semantic Segmentation 


We continue our investigations by applying our method to the task of semantic 
segmentation to show its cross-task applicability. We start with the experimental 
setup, followed by an evaluation of non-targeted and targeted attacks. 


5.1 Experimental Setup 


Dataset: We conduct our experiments on the widely known Cityscapes dataset 
[COR+16]. It contains pixel-level annotations of 5,000 high-resolution images (2,975 
training, 500 validation, and 1,525 test images) being captured in urban street scenes. 
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Table6 The mean intersection-over-union (mIoU) (%) of non-targeted UAP methods on the seman- 
tic segmentation model FCN-8 pretrained on Cityscapes. Our method is compared with GAP 
[PKGB18]. In these experiments, both the source model S and the target model T are either the 
same, i.e., § = T = FCN-8, or different, i.e., S = ERFNet, T = FCN-8. Parameters are set to 
Loo(r) < e, a = 0.7. Best results are printed in boldface 


(a) S =T =FCN-8 


Method € 

2 5 10 20 
GAP 18.1 12.8 4.0 2.1 
Ours 16.2 9.8 2.0 0.3 
(b) S = ERFNet, T = FCN-8 
Method € 

2 5 10 20 
GAP 27.3 16.4 7.0 4.3 
Ours 26.4 14.8 4.1 15 


The original images have a resolution of 2048 x 1024 pixels, that we downsample 
to 1024 x 512 for our experiments [MCKBF17, PKGB 18]. 


Network architecture: Regarding the architecture of the source model S, we use 
FCN-8 [LSD15] for white-box attacks, and ERFNet [RABA18] for the black-box 
setting. FCN-8 consists of an encoder part which transforms an input image into a 
low-resolution semantic representation and a decoder part which recovers the high 
spatial resolution of the image by fusing different levels of feature representations 
together. ERFNet also consists of an encoder-decoder structure, but without any 
bypass connections between the encoder and the decoder. Additionally, residual 
units are used with factorized convolutions to obtain a more efficient computation. 

In our experiments, we consider FCN-8 [LSD15] as our segmentation target 
model T, and use La norm, to be comparable with [MCKBF17, PKGB18]. 
Evaluation metrics: To assess the performance of a semantic segmentation network, 
we used the mean intersection-over-union (mIoU) [COR+16]. It is defined as 


mloU = — )° "i (6) 
— IMI & TP, + FP, + FN,’ 


pe 


with class u € M, class-specific true positives TP,,, false positives FP,,, and false neg- 
atives FN,,. For assessing the impact of non-targeted adversarial attacks on semantic 
segmentation, we compute the mIoU on adversarial examples [AMT 18]. 


190 A. S. Hashemi et al. 


To measure the attack success of our targeted attack, we compute the pixel 
accuracy (PA) between the prediction m = T (x**’) and the target m [PKGB18]. 
In this chapter, we restrict our analysis to the static target segmentation scenario 
[MCKBFI17] and use the same target mask as in [PKGB18, MCKBFI17] (see also 
Fig. 11). 


5.2 Non-targeted Universal Perturbations 


For the non-targeted case, we train our model with the non-targeted adversarial loss 
function (2), where we use (4) as the cross-entropy loss. The maximum permissible 
L, norm of the perturbations for p = œ is set to € € {2, 5, 10, 20}. We report results 
for both our method and GAP [PKGB 18] on the Cityscapes validation set in Table 6a. 
It can be observed that across different values for € our method is superior to GAP 
in terms of decreasing the mloU of the underlying target model. 

We visualize the effect of the generated non-targeted UAPs in Fig. 10 by illus- 
trating the original image, the UAP, the resulting adversarial example, the prediction 
on the original image, the ground truth mask, and the resulting segmentation output. 
Here, the maximum permissible L „ norm of the perturbations p = œ is set toe = 5. 

For investigating the transferability of our generated UAPs, we use a UAP opti- 
mized on ERFNet as the source model S and test it on the target model T being 
FCN-8. Table 6b reports black-box attack results for our method compared to the 
attack method GAP [PKGB 18]. Our non-targeted UAPs decrease the mIoU of the 
FCN-8 on the Cityscapes dataset more than GAP [PKGB18] does, in all different 
ranges of adversarial perturbations (e€ € {2, 5, 10, 20}). These results illustrate the 
effectivity of the generated perturbation. 


5.3 Targeted Universal Perturbations 


In the targeted case, we aim at finding a UAP which forces the segmentation source 
model S and the segmentation target model T to predict a specific target mask 
mm. We now train our model with the targeted adversarial loss function (5), using 
J“ (y, ý) according to (4) for the cross-entropy loss. We apply the same target f as 
in [PKGB18], see m in Fig. 1 le. 

Figure 11 depicts an original image, our generated targeted UAP, the respective 
adversarial example, the prediction on the original image, the target segmentation 
mask, and the prediction on the adversarial example. The results show that the gen- 
erated UAP resembles the target mask and is able to fool the FCN-8 in a way that it 
now outputs the target segmentation mask. 
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(a) Original image x (b) UAP r (c) Adversarial example x**’ 
(d) Prediction for original image (e) Ground truth mask (f) Prediction for adver. example 


Fig. 10 An example of our non-targeted UAPs optimized for the task of semantic segmentation on 
the Cityscapes dataset. Results are displayed on the Cityscapes validation set. In these experiments, 
both the source model S and the target model T are FCN-8, with Loo(r) < € = 5, a = 0.7. The 
pixel values in the UAP are scaled for better visibility 


(a) Original image x (b) UAP r (©) Adversarial example xê% 


(d) Prediction for original image (e) Target mask m (f) Prediction for adv. example 


Fig. 11 An example of our targeted UAPs optimized for the task of semantic segmentation on the 
Cityscapes dataset. Results are displayed on the Cityscapes validation set. In these experiments, 
both the source model S and the target model T are FCN-8, with Loo(r) < e = 10, a = 0.7. The 
pixel values in the UAP are scaled for better visibility 


We compare our targeted attack with two state-of-the-art methods in Table 7. 
While our method performs comparably well as state of the art for weak attacks, 
in medium to strong attacks we outperform both GAP [PKGB18] and UAP-Seg 
[MCKBF17]. 


192 A. S. Hashemi et al. 


Table7 Pixel accuracy (%) of targeted UAP methods (white-box attack) on the semantic segmenta- 
tion model FCN- 8 pretrained on the Cityscapes training set. Results are reported on the Cityscapes 
validation set. Our method is compared to UAP-Seg [MCKBF17] and GAP [PKGB 18]. In these 
experiments, both the source model S and the target model T are FCN-8, with Loo (r) < €, a = 0.7. 
Best results are printed in boldface 


Method € 

2 5 10 20 
GAP 61.2 79.5 92.1 97.2 
UAP-Seg 60.9 80.3 91.0 96.3 
Ours 61.0 81.8 93.1 97.4 


6 Conclusions 


We presented a novel method to effectively generate targeted and non-targeted uni- 
versal adversarial perturbations (UAPs) in both white-box and black-box settings. 
Our proposed method shows new state-of-the-art fooling rates for targeted as well 
as non-targeted UAPs on different classifiers. Additionally, our non-targeted UAPs 
show a significantly higher transferability across models when compared to other 
methods, given that we generated our UAPs on the deepest network in the inves- 
tigation. This is achieved by incorporating an additional loss term during training, 
which aims at increasing the activation of the first layer. Finally, we extended our 
method to the task of semantic segmentation to prove its applicability also in more 
complex environment perception tasks. Due to its state-of-the-art effectiveness for 
object classification and semantic segmentation, we strongly recommend to employ 
the proposed types of attacks in validation of automated vehicles’ environment per- 
ception. 
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Robin Rombach, Patrick Esser, Andreas Blattmann, and Björn Ommer 


Abstract To tackle increasingly complex tasks, it has become an essential ability of 
neural networks to learn abstract representations. These task-specific representations 
and, particularly, the invariances they capture turn neural networks into black-box 
models that lack interpretability. To open such a black box, it is, therefore, crucial 
to uncover the different semantic concepts a model has learned as well as those that 
it has learned to be invariant to. We present an approach based on invertible neural 
networks (INNs) that (i) recovers the task-specific, learned invariances by disentan- 
gling the remaining factor of variation in the data and that (ii) invertibly transforms 
these recovered invariances combined with the model representation into an equally 
expressive one with accessible semantic concepts. As a consequence, neural net- 
work representations become understandable by providing the means to (i) expose 
their semantic meaning, (ii) semantically modify a representation, and (iii) visualize 
individual learned semantic concepts and invariances. Our invertible approach sig- 
nificantly extends the abilities to understand black-box models by enabling post hoc 
interpretations of state-of-the-art networks without compromising their performance. 
Our implementation is available at https://compvis.github.io/invariances/. 


1 Introduction 


Key to the wide success of deep neural networks is end-to-end learning of powerful 
hidden representations that aim to (i) capture all task-relevant characteristics while 
(ii) being invariant to all other variabilities in the data [LeC12, AS18]. Deep learning 
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can yield abstract representations that are perfectly adapted feature encodings for 
the task at hand. However, their increasing abstraction capability and performance 
comes at the expense of a lack in interpretability [BBM+15]: although the network 
may solve a problem, it does not convey an understanding of its predictions or their 
causes, often leaving the impression of a black box [Mil19]. In particular, users are 
missing an explanation of semantic concepts that the model has learned to represent 
and of those it has learned to ignore, i.e., its invariances. 

Providing such explanations and an understanding of network predictions and 
their causes is thus crucial for transparent AI. Not only is this relevant to discover 
limitations and promising directions for future improvements of the AI system itself, 
but also for compliance with legislation [GF17, Eur20], knowledge distillation from 
such a system [Lip18], and post hoc verification of the model [SWM17]. Conse- 
quently, research on interpretable deep models has recently gained a lot of attention, 
particularly methods that investigate latent representations to understand what the 
model has learned [SWM17, SZS+14, BZK+17, FV18, ERO20]. 


Challenges and aims: Assessing these latent representations is challenging due to 
two fundamental issues: (i) To achieve robustness and generalization despite noisy 
inputs and data variability, hidden layers exhibit a distributed coding of semantically 
meaningful concepts [FV 18]. Attributing semantics to a single neuron via backprop- 
agation [MLB+17] or synthesis [YCN+15] is thus impossible without altering the 
network [MSM18, ZKL+16], which typically degrades performance. (ii) End-to-end 
learning trains deep representations toward a goal task, making them invariant to fea- 
tures irrelevant for this goal. Understanding these characteristics that a representation 
has abstracted away is challenging, since we essentially need to portray features that 
have been discarded. 

These challenges call for a method that can interpret existing network representa- 
tions by recovering their invariances without modifying them. Given these recovered 
invariances, we seek an invertible mapping that translates a representation and the 
invariances onto understandable semantic concepts. The mapping disentangles the 
distributed encoding of the high-dimensional representation and its invariances by 
projecting them onto separate multi-dimensional factors that correspond to human- 
understandable semantic concepts. Both this translation and the recovering of invari- 
ances are implemented with invertible neural networks (INNs) [Red93, DSB17, 
KD 18]. For the translation, this guarantees that the resulting understandable rep- 
resentation is equally expressive as the model representation combined with the 
recovered invariances (no information is lost). Its invertibility also warrants that fea- 
ture modifications applied in the semantic domain correctly adjust the recovered 
representation. 


Our contributions: Our contributions to a comprehensive understanding of deep rep- 
resentations are as follows: (i) We present an approach, which, by utilizing invertible 
neural networks, improves the understanding of representations produced by exist- 
ing network architectures with no need for re-training or otherwise compromising 
their performance. (ii) Our generative approach is able to recover the invariances that 
result from the non-injective projection (of input onto a latent representation) which 
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deep networks typically learn. This model then provides a probabilistic visualization 
of the latent representation and its invariances. (iii) We bijectively translate an arbi- 
trarily abstract representation and its invariances via a non-linear transformation into 
another representation of equal expressiveness, but with accessible semantic con- 
cepts. (iv) The invertibility also enables manipulation of the original latent represen- 
tations in a semantically understandable manner, thus facilitating further diagnostics 
of a network. 


2 Background 


Two main approaches to interpretable AI can be identified, those which aim to incor- 
porate interpretability directly into the design of models, and those which aim to 
provide interpretability to existing models [MSM18]. Approaches from the first cat- 
egory range from modifications of network architectures [ZKL+16], over regular- 
ization of models encouraging interpretability [LBMO19, PASC+20], toward com- 
binations of both [ZNWZ18]. However, these approaches always involve a trade-off 
between model performance and model interpretability. Being of the latter category, 
our approach allows to interpret representations of existing models without compro- 
mising their performance. 

To better understand what an existing model has learned, its representations must 
be studied [SWM17]. Syegedy et al. [SZS+14] show that both random directions 
and coordinate axes in the feature space of networks can represent semantic prop- 
erties and conclude that they are not necessarily represented by individual neurons. 
Different works attempt to select groups of neurons which have a certain seman- 
tic meaning, such as based on scenes [ZKL+15], objects [SR15] and object parts 
[SRD14]. [BZK+17] studied the interpretability of neurons and found that a rota- 
tion of the representation space spanned by the neurons decreases its interpretability. 
While this suggests that the neurons provide a more interpretable basis compared 
to a random basis, [FV18] shows that the choice of basis is not the only challenge 
for interpretability of representations. Their findings demonstrate that learned repre- 
sentations are distributed, i.e., a single semantic concept is encoded by an activation 
pattern involving multiple neurons, and a single neuron is involved in the encoding 
of multiple different semantic concepts. Instead of selecting a set of neurons directly, 
[ERO20] learns an INN that transforms the original representation space to an inter- 
pretable space, where a single semantic concept is represented by a known group of 
neurons and a single neuron is involved in the encoding of just a single semantic con- 
cept. However, to interpret not only the representation itself but also its invariances, 
it is insufficient to transform only the representation itself. Our approach therefore 
transforms the latent representation space of an autoencoder, which has the capacity 
to represent its inputs faithfully, and subsequently translates a model representation 
and its invariances into this space for semantic interpretation and visualization. 

A large body of works approach interpretability of existing networks based 
on visualizations. Selvaraju et al. [SCD+20] use gradients of network outputs 
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with respect to a convolutional layer to obtain coarse localization maps. Bach et 
al. [BBM+15] propose an approach to obtain pixel-wise relevance scores for a spe- 
cific class of models which is generalized in [MLB+17]. To obtain richer visual 
interpretations, [ZF14, SVZ14, YCN+15, MV16] reconstruct images which maxi- 
mally activate certain neurons. Nguyen et al. [NDY+16] use a generator network for 
this task, which was introduced in [DB16] for reconstructing images from their 
feature representation. Our key insight is that these existing approaches do not 
explicitly account for the invariances learned by a model. Invariances imply that 
feature inversion is a one-to-many mapping and thus they must be recovered to solve 
the task. Recently, [SGM+20] introduced a GAN-based approach that utilizes fea- 
tures of a pretrained classifier as a semantic pyramid for image generation. Nash 
et al. [NKW19] used samples from an autoregressive model of images conditioned 
on a feature representation to gain insights into the representation’s invariances. In 
contrast, our approach recovers an explicit representation of the invariances, which 
can be recombined with modified feature representations, and thus makes the effect 
of modifications to representations, e.g., through adversarial attacks, visible. 

Other works consider visual interpretations for specialized models. Santurkar et 
al. [SIT+19] showed that the quality of images which maximally activate certain 
neurons is significantly improved when activating neurons of an adversarially robust 
classifier. Bau et al. [BZS+19] explore the relationship between neurons and the 
images produced by a generative adversarial network. For the same class of models, 
[GAOI19] finds directions in their input space which represent semantic concepts 
corresponding to certain cognitive properties. Such semantic directions have previ- 
ously also been found in classifier networks [UGP+17] but requires aligned data. 
All of these approaches require either special training of models, are limited to a 
very special class of models which already provide visualizations, or depend on 
special assumptions on model and data. In contrast, our approach can be applied to 
arbitrary models without re-training or modifying them, and provides both visualiza- 
tions and semantic explanations, for both the model’s representation and its learned 
invariances. 


3 Method 


Common tasks of computer vision can be phrased as a mapping from an input image 
x to some output f(x) such as a classification of the image, a regression (e.g., 
of object locations), a (semantic) segmentation map, or a re-synthesis that yields 
another image. Deep learning utilizes a hierarchy of intermediate network layers 
that gradually transform the input into increasingly more abstract representations. 
Letz = ®(x) € R™: be the representation extracted by one such layer (without loss 
of generality we consider z to be an N,-dim vector, flattening it if necessary) and 
f(x) = V(z) = V(®(x)) the mapping onto the output. 

An essential characteristic of a deep feature encoding z is the increasing abstract- 
ness of higher feature encoding layers and the resulting reduction of information. 
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Fig. 1 Proposed architecture. We provide post hoc interpretation for a given deep network f = 
W o ®. For a deep representation z = ®(x) a conditional INN ¢ recovers ®’s invariances v from 
a representation Z which contains entangled information about both z and v. The INN e then 
translates the representation Z into a factorized representation with accessible semantic concepts. 
This approach allows for various applications, including visualizations of network representations of 
natural (green box) and altered inputs (blue box), semantic network analysis (red box) and semantic 
image modifications (yellow box) 


This reduction generally causes the feature encoding to become invariant to those 
properties of the input image, which do not provide salient information for the task at 
hand [CWG+18]. To explain a latent representation, we need to recover such invari- 
ances v and make z and v interpretable by learning a bijective mapping onto under- 
standable semantic concepts, see Fig. 1. Section 3.1 describes our INN ¢ to recover 
an encoding v of the invariances. Due to the generative nature of t, our approach 
can correctly sample visualizations of the model representation and its invariances 
without leaving the underlying data distribution and introducing artifacts. With v 
then available, Sect. 3.2 presents an INN e that translates t’s encoding of z and 
v without losing information onto disentangled semantic concepts. Moreover, the 
invertibility allows modifications in the semantic domain to correctly project back 
onto the original representation or into image space. 


3.1 Recovering the Invariances of Deep Models 


Learning an encoding to help recover invariances: Key to a deep representation is 
not only the information z captures, but also what is has learned to abstract away. To 
learn what z misses with respect to x, we need an encoding Z, which, in contrast to 
z, includes the invariances exhibited by z. Without making prior assumptions about 
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the deep model f, autoencoders provide a generic way to obtain such an encoding Z, 
since they ensure that their input x can be recovered from their learned representation 
Z, which hence also comprises the invariances. 

Therefore, we learn an autoencoder with an encoder E that provides the data 
representation Z = E(x) and a decoder D producing the data reconstruction ¥ = 
D(Z). Section 3.2 will utilize the decoding from Z to Ẹ to visualize both z and 
v. The autoencoder is trained to reconstruct its inputs by minimizing a perceptual 
metric between input and reconstruction, ||x — X||, as in [DB16]. The details of the 
architecture and training procedure can be found in Sect. 3.3, autoencoder E, D. It 
is crucial that the autoencoder only needs to be trained once on the training data. 
Consequently, the same E can be used to interpret different representations z, e.g., 
different models or layers within a model, thus ensuring fair comparisons between 
them. Moreover, the complexity of the autoencoder can be adjusted based on the 
computational needs, allowing us to work with much lower dimensional encodings Z 
compared to reconstructing the invariances directly from the images x. This reduces 
the computational demands of our approach significantly. 


Learning a conditional INN that recovers invariances: Due to the reconstruction 
task of the autoencoder, Z not only contains the invariances v, but also the represen- 
tation z. Thus, we must disentangle [EHO19, LSOL20, KSLO19] v and z using a 
mapping f(-|z) : Z > v = ¢(Z|z) which, depending on z, extracts v from Z. 

Besides extracting the invariances from a given Z, ¢ must also enable an inverse 
mapping from given model representations z to Z to support a further mapping 
onto semantic concepts Sect. 3.2 and visualization based on D(Z). There are many 
different x with ®(x) = z, namely, all those x which differ only in properties that ® 
is invariant to. Thus, there are also many different Z that this mapping must recover. 
Consequently, the mapping from z to Z is set-valued. However, to understand f 
we do not want to recover all possible Z, but only those which are likely under the 
training distribution of the autoencoder. In particular, this excludes unnatural images 
such as those obtained by DeepDream [MOT 15] or adversarial attacks [SZS+14]. In 
conclusion, we need to sample Z ~ p(2|z). 

To avoid a costly inversion process of ®, ¢ must be invertible (implemented as an 
INN) so that a change of variables 


p(v|z) 


[dive DODI det VE DOZ , where v = t(z|z), (1) 


plz) = 


yields p(ĉ|z) by means of the distribution p(v|z) of invariances, given a model 
representation z. Here, the denominator denotes the absolute value of the determinant 
of Jacobian V (t7!) of v œ> t~! (v|z) = Z, which is efficient to compute for common 
invertible network architectures. Consequently, we obtain Z for given z by sampling 
from the invariant space v given z and then applying ¢~', 


Z~ plz) <= va~ pjz), 2 =t! (v|z). (2) 
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Since v is the invariant space for z, both are complementary thus implying indepen- 
dence p(v|z) = p(v). Because a powerful transformation t~! can transform between 
two arbitrary densities, we can assume without loss of generality a Gaussian prior 
p(v) = N(v|0, I), where I is the identity matrix. Given this prior, our task is then to 
learn the transformation t that maps N (v|0, I) onto p(2|z). To this end, we maximize 
the log-likelihood of Z given z, which results in a per-example loss of 


Jo(Z, z) = — log p(Z|z) = — log N(t(2|z)|0, I) — log | det Vt(Z|z)|.. (3) 


Minimizing this loss over the training data distribution p(x) gives t, a bijective 
mapping between Z and (z, v), 


I(t) = Espa) I(E), ®(x))] (4) 
1 
= Ex pee Boon + Nz log 2x — log | det VEGI) 
(5) 


Note that both E and ® remain fixed during minimization of J. 


3.2 Interpreting Representations and Their Invariances 


Visualizing representations and invariances: For an image representation z = 
®(x), (2) presents an efficient approach (a single forward pass through the INN f) 
to sample an encoding Z, which is a combination of z with a particular realization 
of its invariances v. Sampling multiple realizations of Z for a given z highlight 
what remains constant and what changes due to different v: information preserved 
in the representation z remains constant over different samples and information 
discarded by the model ends up in the invariances v and shows changes over different 
samples. Visualizing the samples Z ~ p(Z|z) with x = D(Z) portrays this constancy 
and changes due to different v. To complement this visualization, in the following, 
we learn a transformation of Z into a semantically meaningful representation which 
allows to uncover the semantics captured by z and v. 


Learning an INN to produce semantic interpretations: The autoencoder repre- 
sentation Z is an equivalent representation of (z, v) but its feature dimensions do not 
necessarily correspond to semantic concepts [FV 18]. More generally, without super- 
vision, we cannot reliably discover semantically meaningful, explanatory factors of 
2 [LBL+19]. In order to explain Z in terms of given semantic concepts, we apply the 
approach of [ERO20] and learn a bijective transformation of Z to an interpretable 
representation e(Z) where different groups of components, called factors, correspond 
to semantic concepts. 

To learn the transformation e, we parameterize e by an INN and assume that 
semantic concepts are defined implicitly by pairs of images, i.e., for each semantic 
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concept we have access to training pairs x*, x? that have the respective concept in 
common. For example, the semantic concept “smiling” is defined by pairs of images, 
where either both images show smiling persons or both images show non-smiling 
persons. Applying this formulation, input pairs which are similar in a certain semantic 
concept are similar in the corresponding factor of the interpretable representation 
e(Z). 

Following [ERO20], the loss for training the invertible network e is then given by 


J(e) = Ex x [— log p(e(E(x*)), e(E(x))) 
— log | det Ve(E(x*))| — log | det Ve(E(x°))|] : (6) 


Interpretation by applying the learned INNs: After training, the combination of e 
with ¢ from Sect. 3.1 provides semantic interpretations given a model representation 
z: (2) gives realizations of the invariances v which are combined with z to produce 
2 = t~! (v|z). Then e transforms Z without loss of information into a semantically 
accessible representation (e;); = e(ĉ) = e(t~'(v|z)) consisting of different semantic 
factors e;. Comparing the e; for different model representations z and invariances v 
allows us to observe which semantic concepts the model representation z = ®(-) is 
sensitive to, and which it is invariant to. 


Semantic Modifications of Latent Representations: The transformations £~! and 
e not only interpret a representation z in terms of accessible semantic concepts (e;);. 
Given v ~ p(v), they also allow to modify Z = t~! (v|z) inasemantically meaningful 
manner by altering its corresponding (e;); and then applying the inverse translation 


e!, 


modification 


—1 
Z-> (ei) (e*) > . (7) 
The modified representation Z* is then readily transformed back into image space 
£* = D(2*). Besides visual interpretation of the modification, £“ can be fed into the 
model W(®(X*)) to probe for sensitivity to certain semantic concepts. 


3.3 Implementation Details 


In this section, we provide implementation details about the exact training procedure 
and architecture of all components of our approach. This is only for the sake of clarity 
and completeness. For readers who are already familiar with INNs or people rather 
interested in the higher level ideas of our approach than in its technical details, this 
section can be safely skipped. 


Autoencoder E, D: In Sect. 3.1, we introduced an autoencoder to obtain a repre- 
sentation Z of x, which includes the invariances abstracted away by a given model 
representation z. This autoencoder consists of an encoder E(x) and a decoder D(z). 
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Table 1 Autoencoder architecture on datasets with images of resolution 28 x 28 


Encoder: FC and Norm refer to fully con- 
nected layer and batch normalization, respec- 
tively. The term “down” denotes downsam- 
pling by factor 2. Leaky ReLU (LRelu) uses 
a slope parameter of 0.2. 


RGB image æ € [°8*?8*3 I = [0,1] 


Decoder: FC and Norm refer to fully con- 
nected layer and batch normalization, respec- 
tively. Leaky ReLU (LRelu) uses a slope pa- 
rameter of 0.2. 


ê € R* ~ N(p, diag(oo")) 


Conv down, Norm, LReLU — R!4x14x64 


FC > RTX7x 128 


Conv down, Norm, LReLU — R™7*!8 


Conv transpose, Norm, LReLU > R!4x14x64 


FC + (u, oo") € R™ x Rê 


Conv transpose, Tanh — [?8*?8*3 I = [0, 1] 
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Because the INNs ¢ and e transform the distribution of Z, we must ensure 
a strictly positive density for Z to avoid degenerate solutions. This is readily 
achieved with a stochastic encoder, i.e., we predict mean E(x), and diagonal 
E(x),2 of a Gaussian distribution, and obtain the desired representation as Z ~ 
N(Z|E(x) yz, diag(E (x)52)). Following [DW19], we train this autoencoder as a vari- 
ational autoencoder using the reparameterization trick [KW14, RMW14] to match 
the encoded distribution to a standard normal distribution, and jointly learn the scalar 
output variance y under an image metric ||x — x|| to avoid blurry reconstructions. 
The resulting loss function is thus 


1 
J(E, D,y) =E ~ p(x) FE — D(E(x)„ +€ ydiag(E (x)o2))|| + log y 
e€ ~ N (e]0, I) 
Č 
+z = {(E(®)y)? + (E(x)q2); — 1 — log(E(x)g2);} 


i=l 


(8) 


Note that ./(-) and log(-) on multi-dimensional entities are applied element-wise. 
In the experiments shown in this chapter, we use images of spatial resolutions 28 x 28 
and 128 x 128, resulting in different architectures for the autoencoder, summarized in 
Tables | and 2, respectively. For the encoder E processing images of spatial resolution 
128 x 128, we use an architecture based on ResNet-101 [HZRS16], and for the 
corresponding decoder D we use an architecture based on BigGAN [BDS19], where 
we include a small fully connected network to replace the class conditioning used in 
BigGAN by a conditioning on Z. 

In the 28 x 28 case, we use a squared L3 loss for the image metric, which corre- 
sponds to the first term in (8). For our 128 x 128-models, we further use an improved 
metric as in [DB 16], which includes additional perceptual [ZIE+18] and discrimina- 
tor losses. The perceptual loss consists of L, feature distances obtained from different 
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Table 2 Autoencoder architecture for datasets with images of resolution 128 x 128 


Encoder based on Resnet-101: Bottleneck, 
FC, and Norm refer to bottleneck residual 
unit, fully connected layer, and batch nor- 
malization, respectively. The term “down” 
denotes downsampling by factor 2. 


Decoder based on BigGAN: Embed, FC, 
Norm, and ResBlock refer to embedding lay- 
ers, fully connected layers, batch normaliza- 
tion, and residual block, respectively. The 
term “up” denotes upsampling by factor 2. 


Leaky ReLU (LRelu) uses a slope parameter 
of 0.2. 


RGB image ax € I!?8x128x3 I = [0,1] 2 € RM ~ N(p, diag(oo")) 


3x (FC, LReLU) > R6 
FC, Softmax — R1000 
Embed +> h € R! 


FC(2) > R4*4x16-96 
ResBlock(2,h) up > IR8*8*16:96 
ResBlock(Z,h) up + R16*16x8:96 
ResBlock(2,h) up — IR32%32*496 
ResBlock(Z,h) up > R84x64x2-96 
Non-Local Block — IR°4*64*?-96 
ResBlock(Z,h) up > R&4*64%% 

Norm, ReLU, Conv up > JR128x 128x3 


Tanh > ê © [28*128*3 J = [0,1] 


Conv down — R64x64x64 
Norm, ReLU, MaxPool > R??x32x64 
3x BottleNeck > R°2%82%256 
4x BottleNeck down — R!6* 16512 
23x BottleNeck down —> IR°*8*1024 


3x BottleNeck down > R**4*2048 


AvgPool, FC > (u, oo) € RMS x R18 


layers of a fixed, pretrained network. We use a VGG-16 network pretrained on Ima- 
geNet and weighted distances of different layers as in [ZIE+18]. The discriminator 
is trained along with the autoencoder to distinguish reconstructed images from real 
images using a binary classification loss, and the autoencoder maximizes the log- 
probability that reconstructed images are classified as real images. The architectures 
of VGG-16 and the discriminator are summarized in Table 3. 


Details on the INNe for Revealing Semantics of Deep Representations: Previ- 
ous works have successfully applied INNs for density estimation [DSB17], inverse 
problems [AKW+19], and on top of autoencoder representations [ERO20, XYA19, 
DMB+21, BMDO21b, BMDO21a] for a wide range of applications such as video 
synthesis [DMB+21, BMDO21b] and translation between pretrained, powerful net- 
works [REO20]. This section provides details on how we embed the approach 
of [ERO20] to reveal the semantic concepts of autoencoder representations Z, cf. 
Sect. 3.2. 

Since we will never have examples for all relevant semantic concepts, we include 
a residual concept that captures the remaining variability of Z, which is not explained 
by the given semantic concepts. 

Following [ERO20], we learn a bijective transformation e(Z), which translates 
the non-interpretable representation Z invertibly into a factorized representation 


Invertible Neural Networks for Understanding Semantics ... 207 


Table 3 Architectures used to compute image metrics for the autoencoder, which were used for 
training the autoencoder E, D on datasets with images of resolution 128 x 128 

VGG-16 pretrained on ImageNet for feature Discriminator: All convolutions use kernel 
extraction. Output of bold layers are used to size 4. Norm refers to batch normalization, 


compute feature distances. leaky ReLU uses a slope parameter 0.2. 
"RGB image æ € R8138 RGB image a € R18*128x3 

2x Conv, ReLU — R128*128x64 Conv down, LReLU — R***®4*64 
MaxPool — R°*64*64 Conv down, Norm, LReLU > R32x32x128 

2x Conv, ReLU — R&*64*128 Conv down, Norm, LReLU — R16x16x256 
MaxPool — R32x32x128 Conv down, Norm, LReLU > R8*8x512 

3x Conv, ReLU — R*2*32%256 Conv, Norm, LReLU — R8*8x512 
MaxPool — IR16*16%256 Conv + R®*8*! 

3x Conv, ReLU — R!8x16x512 


MaxPool — R8*8x512 


3x Cony, ReLU > R8*8x512 


(e;(Z)) o = e(Z), where each factor e; € R represents one of the given semantic 
concepts fori = 1,..., K, and eg € Ro is the residual concept. 

The INN e establishes a one-to-one correspondence between an encoding and 
different semantic concepts and, conversely, enables semantic modifications to cor- 
rectly alter the original encoding (see next section). Being an INN, e(Z) and Z need 
to have the same dimensionality and we set Ne, = N; — ee , Ne;- We denote the 
indices of concept i with respect to e(Z) as T; C {1,..., Nz} such that we can write 
ei = (e(2)x eer, 

In the following, we emphasize on deriving a loss function for training the semantic 
INN. Let e; be the factor representing some semantic concept, e.g., gender, that the 
contents of two images x*, x? share. Then the projection of their encodings 2", 2° 
onto this semantic concept must be similar [ERO20, KWKT15], 


A ab nd ab 
e(Z") ~ eil?) where 2" = E(x*),2° = E(x’). (9) 
Moreover, to interpret Z we are interested in the separate contribution of different 


semantic concepts e; that explain Z. Hence, we seek a mapping e(-) that strives to 
disentangle different concepts, 


ei) Le;(2) Vix j,x wherez= E(x). (10) 
The objectives in (9), (10) imply a correlation in e; for pairs Z° and 2° and no 


correlation between concepts e;, ej fori A j. This calls for a Gaussian distribution 
with a covariance matrix that reflects these requirements. 
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Let e* = (ef) = (e;(E(x*))) and e? likewise, where x”, x? are samples from a 
training distribution p(x*, x?) for the ith semantic concept. The distribution of pairs 
e? and e? factorizes into a conditional and a marginal, 


p(e*,e”) = p(ele*) ple’). (11) 


Objective (10) implies a diagonal covariance for the marginal distribution p(e’), 
i.e., a standard normal distribution, and (9) entails a correlation between e? and e. 


Therefore, the correlation matrix is £* = p diag ((5z, ORE), where 


1 ifkeZ,;, 
ôz (k) = 0 else 


By symmetry, p(e”) = p(e*), which gives 
pele") = N (eze, I — (Z")’). (12) 


Inserting (12) and a standard normal distribution for p(e*) into (11) yields the neg- 
ative log-likelihood for a pair e*, e’. 
Given pairs x*, x? as training data, another change of variables from Z*° = E(x*) 


to e? = e(2') gives the training loss function for e as the negative log-likelihood of 
aa a 


B52, 


J(e) = Ex x [— log p(e(E(x*)), e(E(x))) 
— log | det Ve(E(x"))| — log | det Ve(E(x°))|] . (13) 


For simplicity, we have derived the loss for a single semantic concept e;. Simply 
summing over the losses of different semantic concepts yields their joint loss function 
and allows us to learn a joint translator e for all of them. 

In the following, we focus on the log-likelihood of pairs. The loss for e in (13) 
contains the log-likelihood of pairs e*, e”. Inserting (12) and a standard normal 
distribution for p(e*) into (11) yields 


1 (e? — pe?) a 
— log ple*, e) = 5 yo 4+ Veer + Ve ] +e, (14) 


{=p 
keT; p keTé k=1 


where C = C(p, Nz) is a constant that can be ignored for the optimization process. 
p € (0, 1) determines the relative importance of loss terms corresponding to the 
similarity requirement in (9) and the independence requirement in (10). We use a 
fixed value of pọ = 0.9 for all experiments. 

In the following, we describe the architecture of the semantic INN. In our imple- 
mentation, e is built by stacking invertible blocks, see Fig. 2, which consist of three 
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Fig. 2 A single invertible block used to build our invertible neural networks 


ê Bagus] | f 1 Bagus] | f se | o- 


* z= 4(z) 
L t v~ N (0,1) haad 
E- 

-E Be ea: 


Fig. 3 Architectures of our INN models. top: The semantic INN e consists of stacked invertible 
blocks. bottom: The conditional INN t is composed of a embedding module H that downsamples 
(upsamples if necessary) a given model representation h = H(z) = H(®(x)). Subsequently, h is 
concatenated to the inputs of each block of the invertible model 


IP 


I> 


invertible layers: coupling blocks [DSB17], actnorm layers [KD18], and shuffling 
layers. The final output is split into the factors (e;), see Fig. 3. 

Coupling blocks split their input x = (x1, x2) along the channel dimension and 
use fully connected neural networks s; and t; to perform the following computation: 


xı =X) O sı (x2) + T1 (x2), (15) 
X2 = x2 © $2(X1) + T2 (%1), (16) 


with the element-wise multiplication operator ©. Actnorm layers consist of learnable 
shift and scale parameters for each channel, which are initialized to ensure activations 
with zero mean and unit variance on the first training batch. Shuffling layers use a 
fixed, randomly initialized permutation to shuffle the channels of its input, which 
provides a better mixing of channels for subsequent coupling layers. 


Conditional INN ¢ for recovering invariances of deep representations: We first 
elaborate on the architecture of the conditional INN. We build the conditional invert- 
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Fig. 4 Graphical distinction f(x) = ¥(@(x)) 
of information flow during 
training and inference. 
During training of t, the 
encoder E provides an 
(approximately complete) 
data representation, which is 
used to learn the invariances 
of a given model’s 
representations z. At 
inference, the encoder is not 
necessarily needed anymore: 
given a representation 

z = ®(x), invariances can 
be sampled from the prior 
distribution and decoded into 
data space through £7! 


ible neural network ¢ by expanding the semantic model e as follows: Given a model 
representation z, which is used as the conditioning of the INN, we first calculate its 
embedding 

h = H(z) (17) 


which is subsequently fed into the affine coupling block: 


Xi =X, O s1 (x2, h) + T1 (x2, h), (18) 
X2 = x2 © $2(X1,h) + T2(¥1, h), (19) 


with © again being an element-wise multiplication operator, where s; and t; are 
modified from (16) such that they are capable of processing a concatenated input 
(xi, h). The embedding module H is usually a shallow convolutional neural network 
used to down-/upsample a given model representation z to a size that the networks 
si and T; are able to process. This means that t, analogous to e, consists of stacked 
invertible blocks, where each block is composed of coupling blocks, actnorm layers, 
and shuffling layers, cf. Sect. 3.3, details on the INN e for revealing semantics of deep 
representations, and Fig. 2. The complete architectures of both ¢ and e are depicted in 
Fig. 3. Additionally, Fig. 4 provides a graphical distinction of the training and testing 
process of t. During training, the autoencoder D o E provides a representation of the 
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data that contains both the invariances and the representation of some model w.r.t. the 
input x. After training of t, the encoder may be discarded and visual decodings and/or 
semantic interpretations of a model representation z can be obtained by sampling 
and transforming v as described in (2). 


4 Experiments 


To explore the applicability of our approach, we conduct experiments on several mod- 
els: SqueezeNet [IHM+16], which provides lightweight classification, FaceNet 
[SKP15], a baseline for face recognition and clustering, trained on the VGGFace2 
dataset [CSX+18], and variants of ResNet [HZRS16], a popular architecture often 
used when fine-tuning a classifier on a specific task and dataset. 

Experiments are conducted on the following datasets: CelebA [LLWT15], Ani- 
malFaces [LHM+19], Animals (containing carnivorous animals), ImageNet 
[DDS+09], and ColorMNIST, which is an augmented version of the MNIST dataset 
[LCB98], where both background and foreground have random, independent colors. 
Evaluation details follow in Sect. 4.5. 


4.1 Comparison to Existing Methods 


A key insight of our chapter is that reconstructions from a given model’s repre- 
sentation z = ®(x) are impossible if the invariances the model has learned are not 
considered. In Fig. 5, we compare our approach to existing methods that either try 
to reconstruct the image via gradient-based optimization [MV16] or by training 
a reconstruction network directly on the representations z [DB16]. By condition- 
ally sampling images ¥ = D(Z), where we obtain Z via the INN ¢ as described in 
(2) based on the invariances v ~ p(v) = N (0, I), we bypass this shortcoming and 
obtain natural images without artifacts for any layer depth. The increased image qual- 
ity is further confirmed by the Fréchet inception distance (FID) scores [HRU+17] 
reported in Table 4. 


4.2 Understanding Models 


Interpreting a face recognition model: FaceNet [SKP15] is a widely accepted 
baseline in the field of face recognition. This model embeds input images of human 
faces into a latent space where similar images have a small L2-distance. We aim to 
understand the process of face recognition within this model by analyzing and visu- 
alizing learned invariances for several layers explicitly; see Table 6 for a detailed 
breakdown of the various layers of FaceNet. For the experiment, we use a pre- 
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Reconstructions ĉ from representations z = ®(a) of different layers 


Method input conv5 fc6 fc7 fc8 


ours 


[DB16] 


[MV 16] 


Fig. 5 Comparison to existing network inversion methods for AlexNet [KSH12]. In contrast 
to the methods of [DB16] (D&B) and [MV16] (M&V), our invertible method explicitly samples 
the invariances of ® w.r.t. the data, which circumvents a common cause for artifacts and produces 
natural images independent of the depth of the layer which is reconstructed 


Table 4 FID scores for layer visualizations of AlexNet, obtained with our method and [DB 16] 
(D&B). Scores are calculated on the Animals dataset 


Layer conv5 fc6 fc7 fc8 Output 
ours 23.6 + 0.5 24.3 + 0.7 24.9 + 0.4 26.4 + 0.4 27.4 + 0.3 
D&B 25.2 24.9 27.2 36.1 352.6 


trained FaceNet and train the generative model presented in (2) by conditioning on 
various layers. Figure 6 depicts the amount of variance present in each selected layer 
when generating n = 250 samples for each of the 100 different input images. This 
variance serves as a proxy for the amount of abstraction capability FaceNet has 
learned in its respective layers: more abstract representations allow for a rich variety 
of corresponding synthesized images, resulting in a large variance in image space 
when being decoded. We observe an approximate exponential growth of learned 
invariances with increasing layer depth, suggesting abstraction mainly happens in 
the deepest layers of the network. Furthermore, we are able to synthesize images that 
correspond to the given model representation for each selected layer. 


How does relevance of different concepts emerge during training? Humans tend 
to provide explanations of entities by describing them in terms of their semantics, 
e.g., size or color. In a similar fashion, we want to semantically understand how a 
network (here: SqueezeNet) learns to solve a given problem. 
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Fig.6 Left: Visualizing FaceNet representations and their invariances. Sampling multiple recon- 
structions £ = D(t—!(v|ze)) shows the degree of invariance learned by different layers £. The 
invariance w.r.t. pose increases for deeper layers as expected for face identification. Surprisingly, 
FaceNet uses glasses as an identity feature throughout all its layers as evident from the spatial 
mean and variance plots, where the glasses are still visible. This reveals a bias and weakness of the 
model. Right: spatially averaged variances over multiple x and layers 


Intuitively, a network should, for example, be able to solve a given classification 
problem by focusing on the relevant information while discarding task-irrelevant 
information. To build on this intuition, we construct a toy problem: digit classifica- 
tion on ColorMNIST. We expect the model to ignore both the random background 
and foreground colors of the input data, as it does not help making a classification 
decision. Thus, we apply the invertible approach presented in Sect. 3.2 and recover 
three distinct factors: digit class,background color,and foreground 
color. To capture the semantic changes occurring over the course of training of this 
classifier, we couple 20 instances of the invertible interpretation model on the last 
convolutional layer, each representing a checkpoint between iteration 0 and iteration 
40000 (equally distributed). The result is shown in Fig. 7: we see that the digit fac- 
tor becomes increasingly more relevant, with its relevance being strongly correlated 
to the accuracy of the model. 


4.3 Effects of Data Shifts on Models 


This section investigates the effects that altering input data has on the model we 
want to understand. We examine these effects by manipulating input data through 
adversarial attacks or image stylization. 


How do adversarial attacks affect network representations? Here, we experiment 
with the fast gradient sign method (FGSM) [GSS15], which manipulates the input 
image by maximizing the objective of a given classification model. To understand 
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Fig. 7 Analyzing the degree to which different semantic concepts are captured by a network 
representation changes as training progresses. For SqueezeNet on ColorMNIST, we measure 
how much the data varies in different semantic concepts e; and how much of this variability is 
captured by z at different training iterations. Early on, z is sensitive to foreground and background 
color, and later on it learns to focus on the digit attribute. The ability to encode this semantic concept 
is proportional to the classification accuracy achieved by z. At training iterations 4k and 36k, we 
apply our method to visualize model representations and thereby illustrate how their content changes 
during training 


how such an attack modifies representations of a given model, we first compute the 
image’s invariances with respect to the model as v = t(E(x)|®(x)). For an attacked 
image x*, we then compute the attacked representation as z* = ®(x*). Decoding 
this representation with the original invariance v allows us to precisely visualize 
what the adversarial attack changed. This decoding, ĉ* = D(t(v|z*)), is shown in 
Fig. 8. We observe that, over layers of the network, the adversarial attack gradually 
changes the representation toward its target. Its ability to do so is strongly correlated 
with the amount of invariances, quantified as the total variance explained by v, for a 
given layer as also observed in [JBZB19]. 


How does training on different data affect the model? Geirhos et al. [GRM+19] 
proposed the hypothesis that classification networks based on convolutional blocks 
mainly focus on texture patterns to obtain class probabilities. We further validate 
this hypothesis by training our invertible network ¢ conditioned on pre-logits z = 
®(x) (i.e., the penultimate layer) of two ResNet-50 realizations. As shown in 
Fig. 9, a ResNet architecture trained on standard ImageNet is susceptible to the 
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Visualizing perturbed representation at 


Perturbation input conv fc logits prediction 
Siamese 
None 
me Q cat 
Siamese 
cat 
Mountain 
Attack . 
lion 


11.82% 7.22% 49.59% 84.77% 


Variance of Z explained by v 
(£0.52) (£0.16) (£2.00) (£5.77) 


Fig. 8 Visualizing FGSM adversarial attacks on ResNet-101. To the human eye, the original 
image and its attacked version are almost indistinguishable. However, the input image is correctly 
classified as “Siamese cat”, while the attacked version is classified as “mountain lion”. Our approach 
visualizes how the attack spreads throughout the network. Reconstructions of representations of 
attacked images demonstrate that the attack targets the semantic content of deep layers. The variance 
of Z explained by v combined with these visualizations shows how increasing invariances cause 
vulnerability to adversarial attacks 


so-called “texture-bias”, as samples generated conditioned on representation of pure 
texture images consistently show valid images of corresponding input classes. We 
furthermore visualize that this behavior can indeed be removed by training the same 
architecture on a stylized version of ImageNet!; the classifier does focus on shape. 
Rows 10-12 of Fig. 9 show that the proposed approach can be used to generate 
sketch-based content with the texture-agnostic network. 


4.4 Modifying Representations 


Invertible access to semantic concepts enables targeted modifications of represen- 
tations Z. In combination with a decoder for Z, we obtain semantic image editing 
capabilities. We provide an example in Fig. 10, where we modify the factors hair 
color, glasses, gender, beard, age, and smile. We infer z = E(x) from 
an input image. Our semantic INN e then translates this representation into semantic 
factors (e;); = e(Z), where individual semantic concepts can be modified indepen- 
dently via the corresponding factor e;. In particular, we can replace each factor with 
that from another image, effectively transferring semantics from one representation 
onto another. Due to the invertibility of e, the modified representation can be trans- 


1 We used weights available at https://github.com/rgeirhos/texture-vs-shape. 
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Samples ĉ = D(t~'(v|z)) conditioned on ResNet pre-logits z = (x) 


P anilla: ResNet-50 trained on @stylizea: ResNet-50 trained on 
standard ImageNet stylized ImageNet 


Inputs 


Fig. 9 Revealing texture bias in ImageNet classifiers. We compare visualizations of z from the 
penultimate layer of ResNet-50 trained on standard ImageNet (left) and a stylized version of 
ImageNet (right). On natural images (rows 1-3) both models recognize the input, removing textures 
through stylization (rows 4—6) makes images unrecognizable to the standard model, however it 
recognizes objects from textured patches (rows 7-9). Rows 10-12 show that a model without 
texture bias can be used for sketch-to-image synthesis 


lated back into the space of the autoencoder and is readily decoded to a modified 
image x*. 

To observe which semantic concepts FaceNet is sensitive to, we compute the 
average distance || f(x) — f(x*)|| between its embeddings of x and semantically 
modified x* over the test set (last row in Fig. 10). Evidently, FaceNet is particu- 
larly sensitive to differences in gender and glasses. The latter suggests a failure of 
FaceNet to identify persons correctly after they put on glasses. 
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Input hair glasses gender beard age smiling 


T €e €2 e3 


€4 €5 e 


Mean embedding 0.872 1.000 1.061 0.803 0.874 0.833 
Distance (+ std) — (40.048) (+0.046) (+0.030) (+0.041) (+0.053) (+£0.034) 


Fig. 10 Semantic modifications on CelebA. For each column, after inferring the semantic factors 
(e;); = e(E(x)) of the input x, we replace one factor e; by that from another randomly chosen 
image that differs in this concept. The inverse of e translates this semantic change back into a 
modified Z, which is decoded to a semantically modified image. Distances between FaceNet 
embeddings before and after modification demonstrate its sensitivity to differences in gender and 
glasses (see also Fig. 6) 


4.5 Evaluation Details 


Here, we provide additional details on the investigated neural networks in Sect. 4 
and present a way to quantify the amount of invariances in those networks. This 
section is only for completeness. Similar to Sect. 3.3, this section can be skipped 
for readers who are interested in the higher level concepts of our approach rather 
than its technical details. An overview of INN hyperparameters for all experiments 
is provided in Table 5. 


Table 5 Hyperparameters of INNs for each experiment. Parameter n flow denotes the number of 
invertible blocks within in the model, see Fig. 2. Parameters /,, and hg refer to the width and depth 
of the fully connected subnetworks s; and T;, respectively 


Experiment INN Input dim. |7 flow hy ha 
Comparison Sect. 4.1 t 128 20 1024 
Understanding models: t 128 20 512 2 
FaceNet, Sect. 4.2 

Understanding models: e 128 12 512 2 
FaceNet, Sect. 4.2 

Data effects: Adversarial t 128 20 1024 2 
attack, Sect. 4.3 

Data effects: Texture bias, t 268 20 1024 2 
Sect. 4.3 

Modifications: FaceNet e 128 12 512 2 
& CelebA Sect. 4.4 
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Table 6 High-level architectures of FaceNet and ResNet, depicted as pytorch-modules. 
Layers investigated in our experiments are marked in bold. Spatial sizes are provided as a visual aid 
and vary from model to model in our experiments. If not stated otherwise, we always extract from 


the last layer in a series of blocks (e.g., on the right: 23 x BottleNeck down —> 


to the last module in the series of 23 blocks) 
FaceNet: Implementations of layers Mixed, 
Block35, Blocki7, Block8 can be found 
at https://github.com/timesler/ 
facenet-pytorch. In Section 4.2, the 
representation from the 2nd convolutional 
layers are extracted. Furthermore, BN and 
AdaAvgPool refer to batch normalization 


R8x8x 1024 


ResNet-101: See https: //pytorch. org/ 

docs/stable/torchvision/models. html 
for details on other variants of ResNet. 
Bottleneck, FC, and Norm refer to bottleneck 
residual unit, fully connected layer, and 
batch normalization, respectively. The term 
“down” denotes downsampling by factor 2. 


and adaptive average pooling, respectively. 
The term “down” denotes downsampling by 
factor 2. 


RGB image x € R18*128x3 
3x Conv, BN, ReLU — R®+x61x64 
MaxPool — R30x30x64 
3x Cony, BN, ReLU — R!?*13x256 
5x Block35 — R13x13x256 
Mixed down — R&*®*896 
10x Block17 — R®*®*896 
Mixed down — IR2%2*1792 
5x Block8 — R2*2*1792 
AdaAvgPool — R!*!*1792 


Dropout, Linear, BN — R°!? 


RGB image x € JR128% 1283 
Conv down — R®4*64x64 
Norm, ReLU, MaxPool — IR°?**?*64 
3x BottleNeck > R32x32x256 
4x BottleNeck down — R!6*16*512 
23x BottleNeck down — IR8*8*1024 
3x BottleNeck down — IR4*4%2048 


AvgPool, FC 


output — R! 


identity embedding > R°” 


Throughout our experiments, we interpret four different models: SqueezeNet, 
AlexNet, ResNet, and FaceNet. Summaries of each of model’s architecture are 
provided in Table 6 and Table 7. Implementations and pretrained weights of these 
models are taken from: 


SqueezeNet (1.1) https://pytorch.org/docs/stable/_modules/torchvision/models/squeezenet} 
ResNet : https://pytorch.org/docs/stable/_modules/torchvision/models/resnet.html} 


AlexNet : https://pytorch.org/docs/stable/_modules/torchvision/models/alexnet.html; 


FaceNet : https://github.com/timesler/facenet-pytorch. 


Explained variance: To quantify the amount of invariances and semantic concepts, 
we use the fraction of the total variance explained by invariances (Fig. 8) and the 
fraction of the variance of a semantic concept explained by the model representation 
(Fig. 7). 
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Table 7 High-level architectures of SqueezeNet and AlexNet, depicted as pytorch- 


modules. See Table 6 for further details 
SqueezeNet: We extract the penultimate 
Fire block for interpretation in Section 4.2. 
AdaAvgPool and Fire refer to adaptive aver- 
age pooling and the Fire module introduced 
in [THM* 16]. 


RGB image x € R!?8x128x3 
Conv, ReLU, MaxPool — R?1*3!*64 
2x Fire > IR31x31x128 
MaxPool — IR19*15*128 
2x Fire > pR15x15x256 
MaxPool — R7*7x256 
4x Fire — R7%7%512 
Dropout, Conv, ReLU > R7*7*1000 
AdaAvgPool — R7*7x1000 


output — R100 


AlexNet: The first convolution uses kernel 
size 11. AdaAvgPool, Flatten, and Linear re- 
fer to adaptive average pooling, reshaping the 
three-dimensional representation of the input 
to a vector, and fully-connected layer, respec- 
tively. 


RGB image x € JR128x 128x3 
Conv, ReLU, MaxPool — R15*15*64 
Conv, ReLU, MaxPool — R™7**192 

Conv, ReLU — R7*7*384 
2x Cony, ReLU — IR7*7*?56 
MaxPool — R2%3%*256 
AdaAvgPool, Flatten + R°!6 
Dropout, Linear, ReLU — R1°% 
Dropout, Linear, ReLU — R206 


Linear > R10 


Using the INN ¢, we can consider Z = t~!(v|z) as a function of v and z. The total 


variance of Z is then obtained by sampling v, via its prior which is a standard normal 
distribution, and z, via z = ®(x) with x ~ Pvaia(x) sampled from a validation set. 
We compare this total variance to the average variance obtained when sampling v 
for a given z to obtain the fraction of the total variance explained by invariances: 


Var y~N ojo, #7 | (v| P(x’) 
Var t-!(v| ®(x)) 


(20) 


De! ~ paia (a!) 
xX ~ Pyaiia(X) 


v ~ N(v0, I) 


In combination with the INN e, which transform Z to semantically meaningful 
factors, we can analyze the semantic content of a model representation z. To analyze 
how much of a semantic concept represented by factor e; is captured by z, we use e 
to transform Z into e; and measure its variance. To measure how much the semantic 
concept is explained by z, we simply swap the roles of z and v in (20), to obtain 
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Vat s~ paia) EAE); 
Var y ~ N(v/0, I) e(t~'(v| ®(x))); 


X ~ Pyatia(X) 


v~ N (v"|0,1) (21) 


Figure 8 reports (20) and its standard error when evaluated via 10k samples, and 
Fig. 7 reports (21) and its standard error when evaluated via 10k samples. 


Conclusions 


Understanding a representation in terms of both its semantics and learned invariances 
is crucial for interpretation of deep networks. We presented an approach (i) to recover 
the invariances a model has learned and (ii) to translate the representation and its 
invariances onto an equally expressive yet semantically accessible encoding. Our 
diagnostic method is applicable in a plug-and-play fashion on top of existing deep 
models with no need to alter or retrain them. Since our translation onto semantic fac- 
tors is bijective, it loses no information and also allows for semantic modifications. 
Moreover, recovering invariances probabilistically guarantees that we can correctly 
visualize representations and sample them without leaving the underlying distribu- 
tion, which is a common cause for artifacts. Altogether, our approach constitutes a 
powerful, widely applicable diagnostic pipeline for explaining deep representations. 
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Abstract Calibrated confidence estimates obtained from neural networks are cru- 
cial, particularly for safety-critical applications such as autonomous driving or med- 
ical image diagnosis. However, although the task of confidence calibration has been 
investigated on classification problems, thorough investigations on object detection 
and segmentation problems are still missing. Therefore, we focus on the investi- 
gation of confidence calibration for object detection and segmentation models in 
this chapter. We introduce the concept of multivariate confidence calibration that is 
an extension of well-known calibration methods to the task of object detection and 
segmentation. This allows for an extended confidence calibration that is also aware 
of additional features such as bounding box/pixel position and shape information. 
Furthermore, we extend the expected calibration error (ECE) to measure miscali- 
bration of object detection and segmentation models. We examine several network 
architectures on MS COCO as well as on Cityscapes and show that especially object 
detection as well as instance segmentation models are intrinsically miscalibrated 
given the introduced definition of calibration. Using our proposed calibration meth- 
ods, we have been able to improve calibration so that it also has a positive impact on 
the quality of segmentation masks as well. 
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1 Introduction 


Common neural networks for object detection must not only determine the position 
and class of an object but also denote their confidence about the correctness of 
each detection. This also holds for instance or semantic segmentation models that 
output a score for each pixel indicating the confidence of object mask membership. 
A reliable confidence estimate for detected objects is crucial, particularly for safety- 
critical applications such as autonomous driving, to reliably process detected objects. 
Another example is medical diagnosis, where, for example, the shape of a brain tumor 
within an MRI image is of special interest [MWT+20]. 

Each confidence estimate might be seen as a probability of correctness, reflect- 
ing the model’s uncertainty about a detection or the pixel mask. During inference, 
we expect the estimated confidence to match the observed precision for a predic- 
tion. For example, given 100 predictions with 80% confidence each, we expect 80 
predictions to be correctly predicted [GPSW17, KKSH20]. However, recent work 
has shown that the confidence estimates of either classification or detection models 
based on neural networks are miscalibrated, i.e., the confidence does not match the 
observed accuracy in classification [GPSW17] or the observed precision in object 
detection [KKSH20]. While confidence calibration within the scope of classification 
has been extensively investigated [NCH15, NC16, GPSW17, KSFF17, KPNK+19], 
we recently defined the term of calibration for object detection and proposed methods 
to measure and resolve miscalibration [KKSH20, SKR+21]. In this context, we mea- 
sured miscalibration w.r.t. the position and scale of detected objects by also including 
the regression branch of an object detector into a calibration mapping. We have been 
able to show that modern object detection models also tend to be miscalibrated. 
On the one hand, this can be mitigated using standard calibration methods. On the 
other hand, we show that our proposed methods for position- and shape-dependent 
calibration [KKSH20] are able to further reduce miscalibration. 

In this chapter we review our methods for position- and shape-dependent confi- 
dence calibration and provide a definition for confidence calibration for the task 
of instance/semantic segmentation. To this end, we extend the definition of the 
expected calibration error (ECE) to measure miscalibration within the scope of 
instance/semantic segmentation. Furthermore, we adapt the extended calibration 
methods originally designed for object detection [KKSH20] to enable position- 
dependent confidence calibration for segmentation tasks as well. 

This chapter is structured as follows. In Sect. 2 we review the most important 
related works regarding confidence calibration. In Sect. 3 we provide the definitions 
of calibration for the tasks of object detection, instance segmentation, and semantic 
segmentation. Furthermore, in Sect. 4 we present the extended confidence calibration 
methods for object detection and segmentation. Extensive experimental evaluations 
on a variety of architectures and computer vision problems, including object detection 
and instance/semantic segmentation, are discussed in Sect. 5. Finally, we provide 
conclusions and discuss our most important findings. 
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2 Related Works 


In the past, most research focused on measuring and resolving miscalibration for 
classification tasks. In this scope, the expected calibration error (ECE) [NCH15] is 
commonly used in conjunction with the Brier score and negative log likelihood (NLL) 
loss to measure miscalibration. For calculating the ECE, all samples are grouped into 
equally sized bins by their confidence. Afterward, for each bin the accuracy is calcu- 
lated and used as an approximation of the accuracy of a single sample in its respective 
bin. Recently, several extensions such as classwise ECE [KPNK+19], marginal cali- 
bration error [KLM 19], or adaptive calibration error [NDZ+19] for the evaluation of 
multi-class problems have also been proposed, where the ECE is evaluated for each 
class separately. In contrast to previous work, we extend the common ECE definition 
to the task of object detection [KKSH20] and instance/semantic segmentation. This 
extension allows for a position-dependent miscalibration evaluation so that it is pos- 
sible to quantify miscalibration separately for certain image regions. The definition 
is given in Sect. 3. 

Besides measuring miscalibration, it is also possible to resolve a potential mis- 
calibration by using calibration methods that are applied after inference. These post- 
hoc calibration methods can be divided into binning and scaling methods. Binning 
methods such as histogram binning [ZEO1], isotonic regression [ZE02], Bayesian 
binning into quantiles (BBQ) [NCH15], or ensemble of near-isotonic regression 
(ENIR) [NC16] group all samples into several bins by their confidence (similar to 
the ECE calculation) and perform a mapping from uncalibrated confidence esti- 
mates to calibrated ones. In contrast, scaling methods such as logistic calibration 
(Platt scaling) [Pla99], temperature scaling [GPSW17], beta calibration [KSFF17], 
or Dirichlet calibration [KPNK+19] scale the network logits before sigmoid/softmax 
activation to obtain calibrated confidences. The scaling parameters are commonly 
obtained by logistic regression. Other approaches comprise binwise temperature scal- 
ing [JJY+19] or scaling-binning calibrator [KLM19], combining both approaches to 
further improve calibration performance for classification tasks. 

Recently, we proposed an extension of common calibration methods to object 
detection by also including the bounding box regression branch of an object detector 
into a calibration mapping [KKSH20]. On the one hand, we extended the histogram 
binning [ZE01] to perform calibration using a multidimensional binning scheme. On 
the other hand, we also extended scaling methods to include position information 
into a calibration mapping. Both approaches are presented in Sect. 4 in more detail. 

Unlike classification, the task of instance or semantic segmentation calibration 
has not yet been addressed by many authors. In the work of [WLK+20], the authors 
perform online confidence calibration of the classification head of an instance seg- 
mentation model and show that this has a significant impact on the mask quality. The 
authors in [KG20] use a multi-task learning approach for semantic segmentation 
models to improve model calibration and out-of-distribution detection within the 
scope of medical image diagnosis. A related approach is proposed by [MWT+20], 
where the authors train multiple fully-connected networks as an ensemble to obtain 
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well-calibrated semantic segmentation models. However, none of these methods 
provide an explicit definition of calibration for segmentation models (instance or 
semantic). This problem is addressed by [DLXS20], where the authors explicitly 
define semantic segmentation calibration and propose a local temperature scaling for 
semantic image masks. This approach utilizes the well-known temperature scaling 
[GPSW17] and assigns a temperature for each mask pixel separately. Furthermore, 
they use a dedicated convolutional neural network (CNN) to infer the temperature 
parameters for each image. Our definition of segmentation calibration in Sect. 3 is 
conform with their definition. Our approach differs from their proposed image-based 
temperature scaling as we use a single position-dependent calibration mapping to 
model the probability distributions for all images. 


3 Calibration Definition and Evaluation 


In this section, we review the term of confidence calibration for classification tasks 
[NCH15, GPSW17] and extend this definition to object detection, instance segmen- 
tation, and semantic segmentation. Furthermore, we derive the detection expected 
calibration error (D-ECE) to measure miscalibration. 


3.1 Definitions of Calibration 


Classification: In a first step, we define the datasets D of size |D| = N with indices 
i e Ī = {1,..., N}, consisting of images x € I¥*W*C, I = [0, 1], with height H, 
width W, and number of channels C. Each image has ground-truth information that 
consists of the class information Y € Y = {1,..., Y}. As an introduction into the 
task of confidence calibration, we start with the definition of perfect calibration for 
classification. A classification model F«ıs outputs a label Y anda corresponding 
confidence score P indicating its belief about the prediction’s correctness. In this 
case, perfect calibration is defined by [GPSW17] 


PÊ =Y|P=p)=p Yp € [0,1]. a) 


In other words, the accuracy PÈ =F] Ê= p) for a certain confidence level p should 
match the estimated confidence. If we observe a deviation, a model is called miscal- 
ibrated. In binary classification, we rather consider the relative frequency of Y = 1 
instead of the accuracy as the calibration measure. We illustrate this difference using 
the following example. Consider N = 100 images of a dataset D with binary ground- 
truth labels Y € {0, 1}, where 50 images are labeled as Y = 0 and 50 images with 
Y = 1. Furthermore, consider a classification model Fas with a sigmoidal output 
in [0, 1], indicating its confidence for Y = 1. In our example, this model is able to 
always predict the correct ground-truth label with confidences P = O and Ê = 1 for 
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f =OandY =1, respectively. Thus, the network is able to achieve an accuracy of 
100% but with an average confidence of 50%. Therefore, we consider the relative 
frequency for Y = 1 ina binary classification task as the calibration goal that is also 
50% in this scenario. 


Object detection: In the next step, we extend our dataset and model to the task of 
object detection. In contrast to a classification dataset, an object detection dataset 
consists of ground-truth annotations Y € Y = {1,..., Y} for each object within an 
image as well as the ground-truth position and shape information R € R = [0, 1]4 
(A denotes the size of the normalized box encoding, comprising the center x and y 
positions Cx, cy, as well as width w and height h). An object detection model Foe 
further outputs a confidence score Pe [0, 1], a label Y € Y, and the corresponding 
position R € R for each detection in an image x. Extending the original formulation 
for confidence calibration within classification tasks [GPSW17], perfect calibration 
for object detection is defined by [KKSH20] 


PM =1|P = p,Y=y,R=n)=p (2) 
Vp €[0, l],yeY¥,reR, 


where M = 1 denotes a correctly classified prediction that matches a ground-truth 
object with a certain intersection-over-union (IoU) score. Commonly, an object detec- 
tion model is calibrated by means of its precision, as the computation of the accuracy 
is not possible without knowing all anchors of a model [KKSH20, SKR+21]. Thus, 
P(M = 1) isa shorthand notation for p(y =Y, R= R) that expresses the precision 
for a dedicated IoU threshold. 


Instance segmentation: At this point, we adapt this idea to define confidence calibra- 
tion for instance segmentation. Consider a dataset with K annotated objects K over 
all images. For notation simplicity, we further use j € J, = {1,..., Hp - Wg} as the 
index for pixel j within a bounding box R; of object k € K in the instance segmenta- 
tion dataset, where Hg and W; denote the width and height of object k, respectively. 
In addition to object detection, a ground-truth dataset for instance segmentation also 
consists of pixel-wise mask labels denoted by Y; e Y* = {0, 1} for each pixel j 
in the bounding box R; of object k. Note that we introduce the star superscript 
(*) here to distinguish between the bounding box label/prediction encoding and the 
instance segmentation label/prediction encoding. An instance segmentation model 
Fins predicts the membership Y; e Y* for each pixel j in the predicted bounding 
box R, to the object mask with a certain confidence Ê; € [0, 1]. We further denote 
R; € R* = [0, 1] as the position of pixel j within the bounding box R,, where A* 
denotes the size of the used position encoding of a pixel within its bounding box. In 
contrast to object detection, it is possible to treat the confidence scores of each pixel 
within an instance segmentation mask as a binary classification problem. In this case, 
the confidence ÊP; can be interpreted as the probability of a pixel belonging to the 
object mask. Therefore, the aim of confidence calibration for instance segmentation 
is that the pixel confidence should match the relative frequency that a pixel is part 
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of the object mask. According to the task of object detection, we further include a 
position dependency into the definition, for instance, segmentation and obtain 


PÈ; =Y;|f = y, Ê; = p,Rj =r)= p, (3) 
Yp € [0,1], ye¥reR j € Tp kEeK. 


The former term can be interpreted as the probability that the prediction Y j fora 
pixel with index j within an object k matches the ground-truth annotation Y j given 
a certain confidence p, a certain pixel position r within the bounding box, as well as 
a certain object category y that is predicted by the bounding box head of the instance 
segmentation model. 


Semantic Segmentation: Compared to object detection or instance segmentation, a 
semantic segmentation dataset does not hold ground-truth information for individual 
objects but rather consists of pixel-wise class annotations Y j; €Y ={1,..., Y}, 
jeJ ={1,..., H - W}. A semantic segmentation model Fem outputs pixel-wise 
labels Ŷ j and probabilities Ê '; with relative position R; within an image. Therefore, 
we can define perfect calibration for semantic segmentation as 


PY; =Y¥,|P; = p, Rj =r) = p, (4) 
Vpée[0,1l,reR JES, 


that is related to the calibration definition of classification [GPSW17]. In addition, 
the confidence score of each pixel must not only reflect its accuracy given a certain 
confidence level but also at a certain pixel position. 


3.2 Measuring Miscalibration 


We can measure the miscalibration of an object detection model as the expected 
deviation between confidence and observed precision which is also known as the 
detection expected calibration error (D-ECE) [KKSH20]. Let further s = (p, y, r) 
denote a single detection with confidence p, class y, and bounding box r so thats € S, 
where S is the aggregated set of the confidence space [0, 1], the set of ground-truth 
labels Y, and the set of possible bounding box positions R. Since M is a continuous 
random variable, we need to approximate the D-ECE using the Riemann-Stieltjes 
integral [GPSW17] 


Ts p plIP(M = 1P = p, Y=y,R=n)- pl] (5) 


= | por = IÊ = p, ¥Y = y, R =r) — p| dFp 7a) (6) 
S 


with Fp ¢ a(S) as the joint cumulative distribution of P, Y , and R. Let a= 
(ap, ay, ans -.-, ar) €S and b= (bp, by, by,,---,br,)' € S denote interval 
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boundaries for the confidence ap, bp € [0, 1], the estimated labels ay, by € Y and 
each quantity of the bounding box encoding a,,, b,,,...d;,, bra € [0, 1] defined on 
the joint cumulative F's p g(S) so that! Fp p R) — Fp y g(a) = PCP € lap, bpl, 
Ye [ay, by], Ri E lan, by], .--, Ra E [a,,, b,,]). The integral in (6) is now approx- 
imated by 


XO PG): |P(M = IÊ = pm, Ê = ym, È = Em) — Pm ’ 


meB 


(7) 


with B as the number of equidistant bins used for integral approximation so that B = 
{1,..., B}, and pm € [0, 1], yn E€ Y and rm € R being the respective bin entities for 
average confidence, the current label, and the average (unpacked) bounding box 
scores within bin m € B, respectively. Let N, denote the amount of samples within 
a single bin m. For large datasets |D| = N, the probability PLM = 1|P = Pm, Y= 
Ym, R = rm) is approximated by the average precision within a single bin m, whereas 
Pm iS approximated by the average confidence, so that the D-ECE can finally be 
computed by 


2 : (8) 


meg ” 


| prec(m) 


where prec(m) and conf(m) denote the precision and average confidence within bin 
m, respectively. 
Similarly, we can extend the D-ECE to instance segmentation by 


A a = y, Py = p, Ry = 0) — pil (9) 
x > x eam — conf(m)|, (10) 
rA 


with a binning scheme over all pixels and freq(m) as the average frequency within 
each bin. In this case, each pixel is treated as a separate prediction and binned by 
its confidence, label class, and relative position. In the same way, the D-ECE for 
semantic segmentation is approximated by 


se [PÈ = Y;1Ê; = p, Ry = r*) — p|] (11) 
DD l (12) 
meB Nm 


with accuracy acc(m) within each bin m. 


' For discrete random variables (e.g., Y), we use a continuous density function of the form py (y) = 
Doyrey P(y*)6(y — y*) with ô(x) as the Dirac delta function. 
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For the calibration evaluation of object detection models, we can use the relative 
position c,, cy, and the shape h, w, of the bounding boxes [KKSH20]. For segmen- 
tation, we consider the relative position x, y, of each pixel within a bounding box 
(instance segmentation) or within the image (semantic segmentation), as well as 
its distance d to the next segment boundary, as we expect a higher uncertainty in 
the peripheral areas of a segmentation mask. We use these definitions to evaluate 
different models in Sect. 5. 


4 Position-Dependent Confidence Calibration 


For post-hoc calibration, we distinguish between binning and scaling methods. 
According to the approximation in (7), binning methods such as histogram bin- 
ning [ZEO1] divide all samples into several bins by their confidence and measure 
the average accuracy/precision within each bin. In contrast, scaling methods rescale 
the logits of a neural network before sigmoid/softmax activation to calibrate a net- 
work’s output. In this section, we extend standard calibration methods so that they are 
capable of dealing with additional information such as position and/or shape. These 
extended methods can be used for confidence calibration of object detection, instance 
segmentation, and semantic segmentation tasks. We illustrate the difference between 
standard calibration and position-dependent calibration using an artificially created 
dataset in Fig. 1. This dataset consists of points with a confidence score and a binary 
ground-truth information in {0, 1}, so that we are able to compute the frequency 
of the points across the whole image. We observe that standard calibration only 
shifts the average confidence to fit the average precision/accuracy. This leads to an 


(a) Uncalibrated (b) After standard calibration (c) After ane dependent calibration 
0% 5% 10% 15% 20% 25% 0% 5% 10% 15% 20% 25% 0% % 10% 15% 20% 25% 
a 
gs 
a 
Cx position Cx position Cx position 


Fig. 1 a Consider an artificial dataset with low calibration error in the center and increasing 
miscalibration toward the image boundaries. b A common logistic calibration model only shifts the 
average confidence to match the accuracy. This leads to a better global ECE score but also degrades 
the miscalibration in the center of the image. ¢ In contrast, position-dependent calibration is capable 
of possible dependencies and leads to an overall improvement of calibration (example taken from 
previous work [KKSH20]) 
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increased calibration error in the center of the image. In contrast, position-dependent 
calibration as defined in this section is able to capture correlations between position 
information and calibration error and reduces the D-ECE across the whole image. 


4.1 Histogram Binning 


Given a binning scheme with B different bins so that B = {1,..., B} with the 
according bin boundaries 0 = a; < a2 <--- < ag+ı = 1 and the corresponding cal- 
ibrated estimate 6,, within each bin, the objective for histogram binning to estimate 
© = {4,,|m € B} is the minimization 


min X` 9 Llan < Bi < am41)* On — FD, (13) 


© 
icl meB 


with 1(-) as the indicator function, yielding a 1 if its argument is true and a 0 if it is 
false [ZE01, GPSW17]. This objective converges to the fraction of positive samples 
within each bin under consideration. However, within the scope of object detection 
or segmentation, we also want to measure the accuracy/precision w.r.t. position and 
shape information. As an extension to the standard histogram binning [ZE01], we 
therefore propose a multidimensional binning scheme that divides all samples into 
several bins by their confidence and by all additional information such as position and 
shape [KKSH20]. We further denote S= (Ê, R) of size Q = A + 1 as the input vec- 
tor to a calibration function consisting of the confidence and the bounding box encod- 
ing, so that Q = {1,..., Q}. In a multidimensional histogram binning, we indicate 
the number of bins as B = (B,,..., Bg) so that B* = {B, = {1,..., Ba}lq € Q} 
with bin boundaries {0 = ai, < a24 <+: < 4p,41.q = lq € Q}. For each bin 
combination mı € 81, ..., Mọ E€ Bg, we have a dedicated calibration parameter 


qay 


= mo |m, € By,...,mg E Bo}. This results in an objective function 


given by 


min 5 J (8), (14) 


iel 
where 


(On, viit mo ` Fs if (amq < Siq < lma) Yq € Q, Ym E B, 


0, otherwise, 


(15) 


which again converges to the fraction of positive samples within each bin. The term 
J(S;) simply denotes that the loss is only applied if a sample §; falls in a certain bin 
combination. A drawback of using this method is that the number of bins is given by 
II qea Ba» Which grows exponentially as the number of dimensions Q grows. 
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4.2 Scaling Methods 


As opposed to binning methods like histogram binning, scaling methods perform a 
rescaling of the logits before sigmoid/softmax activation to obtain calibrated confi- 
dences. We can distinguish between logistic calibration (Platt scaling) [Pla99] and 
beta calibration [KSFF17]. 


Logistic calibration: According to the well-known logistic calibration (Platt scal- 
ing) [Pla99], the calibration parameters are commonly obtained using logistic 
regression. For a binary logistic regression model, we assume normally distributed 
scores for the positive and negative class, so that p(p|-+) ~ N(p; u+, o°) and 
p(pl|—) ~ N(p; “_, 07). The mean values for the two classes are given by p4, 
u- and the variance o? is equal for all classes. First, we follow the derivation of 
logistic calibration introduced by [Pla99, KSFF17] using the likelihood ratio 


1 
LR = PED — exp (Tot? = ns)? + (p = u1) (16) 

p(pl-) 20 

1 
= exp (Solr. = u) = U = 291) (17) 
= exp (Ip — +u) (18) 
Oo 

= exp(y (p — n)), (19) 


where y = sr (U+ — p_) and n = u+ + w_. Assuming a uniform prior over the 
positive and negative classes, the likelihood ratio equals the posterior odds. Hence, 
a calibrated probability is derived by 


ee [ed 1 
POID = LELRO T-+exp—y—m)" me 


which recovers the logistic function. 

Recently, [KKSH20] used this formulation to derive a position-dependent con- 
fidence calibration by using multivariate Gaussians for the positive and negative 
classes. Introducing the concept of multivariate confidence calibration [KKSH20], 
we can derive a likelihood ratio for position-dependent logistic calibration by 


: lo oia cacti |X| 
LRic(@®) = exp 5[GLE- a.) —@1E; 3.) +e), e=logs7, @D 
+ 
where ŝ, = § — u, and ŝ- =§ — w_ using p}, w_ € R® as the mean vectors and 


Z,, Z_ € R2*2 as the covariance matrices for the positive and negative classes, 
respectively. 
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Beta calibration: Similarly, we can also extend the beta calibration method [KSFF17] 
to a multivariate calibration scheme. However, we need a special form of a multivari- 
ate beta distribution as the Dirichlet distribution is only defined over a Q-simplex 
and thus is not suitable for this kind of calibration. Therefore, we use a multivariate 
beta distribution proposed by [LN82] which is defined as 


7 1 = i s 2 P —Lgedy Og 
p(S|a) = Ba) [as i (=) |: + D485 ’ (22) 
qEQ 4 qEQ 
with Qo = {0,..., Q} and the shape parameters œ = (œọ,..., ae)", B= 
(Bo, +--+ pay" that are restricted to a,, B, > 0. Furthermore, we denote àq = a and 
S = = . In this context, B(w) denotes the multivariate beta function. However, 
q 


this kind of beta distribution is only able to capture positive correlations [LN82]. 
Nevertheless, it is possible to derive a likelihood ratio given by 


LRsc Q) = exp ( 2 E log(a}) — a, log@.,) + (@} — a, jogs | + (23) 


qEQ 
x la, log (x 2735) — af log ( > x85) + a 
q4EQo jEeQ jEQ 


with ot, «7 and àt, 47 as the shape parameters for the multivariate beta distribution 


7 ee ; 4 BaT 
in (22) and for the positive and negative class, respectively, so that c = log ( ae ). 
We investigate the effect of position-dependent calibration in the next section using 


these methods. 


5 Experimental Evaluation and Discussion 


In this section, we evaluate our proposed calibration methods for the tasks of object 
detection, instance segmentation, and semantic segmentation using pretrained neu- 
ral networks. For calibration evaluation, we use the MS COCO validation dataset 
[LMB+14] consisting of 5,000 images with 80 different object classes for object 
detection and instance segmentation. For semantic segmentation, we use the panop- 
tic segmentation annotations consisting of 171 different object and stuff categories in 
total. Our investigations are limited to the validation dataset since the training set has 
already been used for network training and no ground-truth annotations are available 
for the respective test dataset. Thus, we use a 50%/50% random split and use the 
first set for calibration training, while the second set is used for the evaluation of the 
calibration methods. Furthermore, we also utilize the Cityscapes validation dataset 
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[COR+16] consisting of 500 images with 19 different classes that are used for model 
training. The Munster and Lindau images are used for calibration training, whereas 
the Frankfurt images are held back for calibration evaluation. 


5.1 Object Detection 


We evaluate our proposed calibration methods histogram binning (HB) (14), logistic 
calibration (LC) (21) and beta calibration (BC) (23) for the task of object detection 
using apretrained Faster R-CNN X101-FPN[RHGS15, WKM+19]as wellasa 
pretrained RetinaNet [LGG+17, WKM+19] on the MS COCO validation set. For 
Cityscapes, we use a pretrained Mask R-CNN R50-FPN [HGDG17, WKM+19], 
where only the predicted bounding box information is used for calibration. For cal- 
ibration evaluation, we use the proposed D-ECE with the same subsets of data that 
have been used for calibration training. Thus, we use a binning scheme of B = 20 
using the confidence p only. In contrast, we use B, = 5 for Q = 5 when all auxiliary 
information is used. Each bin with less than 8 samples is neglected to increase D-ECE 
robustness. We further measure the Brier score (BS) and the negative log likelihood 
(NLL) of each model to evaluate its calibration properties. It is of special interest 
to assess if calibration has an influence to the precision/recall. Thus, we also denote 
the area under precision/recall curve (AUPRC) for each model. As opposed to pre- 
vious examinations [KKSH20], we evaluate calibration using all available classes 
within each dataset. Each of these scores is measured for each class separately. In our 
experiments, we denote weighted average scores that are weighted by the amount of 
samples for each class. We perform calibration using IoU scores of 0.50 and 0.75, 
respectively. Furthermore, only bounding boxes with a score over 0.3 are used for 
calibration to reduce the amount of non-informative predictions. We give a qualita- 
tive example Fig. 2 that illustrates the effect of confidence calibration in the scope 
of object detection. 

In our experiments, we first apply the standard calibration methods (Table 1) 
and compare the results with the baseline miscalibration of the detection model. 
We can already observe a default miscalibration of each examined network. This 
miscalibration is reduced by standard calibration where the scaling methods offer 
the best performance compared to the histogram binning. In a next step, we apply our 
box-sensitive calibration that includes the confidence ), position cx, cy and shape w 
and h into a calibration mapping (Table 2). Similar to the confidence-only case in 
Table 1, the scaling methods consistently outperform baseline and histogram binning 
calibration. 

In each calibration case, we observe a miscalibration of the base network that 
is alleviated using our proposed calibration methods. By examining the reliabil- 
ity diagram (Fig. 3), we observe that the networks are consistently miscalibrated 
for all confidence levels. This can be alleviated either by standard calibration or 
by position-dependent calibration. By examining the position-dependence of the 
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baseline (a) standard (b) full 
Car 100% 99% 99% 
Rider 99% 88% 95% 
Rider 99% 87% 85% 


Fig. 2 Qualitative example of position-dependent confidence calibration on a Cityscapes image 
with detections obtained by a Mask-RCNN. First, standard calibration is performed using similar 
confidence estimates p without any position information. This results in rescaled but still similar 
calibrated confidence estimates (note that we have distinct calibration models for each class). Sec- 
ond, we can observe the effect of position-dependent calibration using the full position information 
(position and shape) as the two riders with similar confidence are rescaled differently. When using 
all available information, we can observe a significant difference in calibration since both, position 
and shape, have an effect to the calibration 


(a) Uncalibrated (b) After standard calibration (c) After position-dependent calibration 


Precision 


Confidence Confidence Confidence 


Fig. 3 Reliability diagram (object detection) for a Mask-RCNN class pedestrian inferred on 
Cityscapes Frankfurt images measured using the confidence only. The blue bar denotes the precision 
of samples that fall into this bin, with the gap to perfect calibration (diagonal) highlighted in red. 
a The Mask-RCNN model is consistently overconfident for each confidence level. b This can be 
alleviated by standard logistic calibration ¢ as well as by position-dependent logistic calibration 


238 F. Küppers et al. 


Table 1 Calibration results for object detection using the confidence p only. The best scores are 
highlighted in bold. All calibration methods are able to improve miscalibration. The best perfor- 
mance is achieved using the scaling methods. Unlike histogram binning, the scaling methods are 


monotonically increasing and thus do not affect the AUPRC score 


Network Dataset IoU Cal. D-ECE _| Brier NLL AUPRC 
method | (%) 
Faster COCO 0.50 Baseline | 17.839 0.206 0.622 0.833 
R-CNN 
HB 5.622 0.166 0.598 0.781 
LC 4.449 0.158 0.478 0.833 
BC 
0.75 Baseline | 29.721 0.272 0.861 0.767 
HB 5.173 0.158 0.594 0.676 
LC 4.387 0.147 0.461 0.766 
BC 4.540 0.147 0.463 0.766 
RetinaNet | COCO 0.50 Baseline | 10.061 0.171 0.514 0.831 
HB 5.332 0.166 0.579 0.795 
LC 4.464 0.161 0.486 0.831 
BC 4.443 0.161 0.487 0.831 
0.75 Baseline | 17.543 0.181 0.544 0.765 
HB 5.773 0.147 0.551 0.721 
LC 4.395 0.142 0.447 0.765 
BC 4.361 0.142 0.447 0.765 
Mask Cityscapes | 0.50 Baseline | 10.822 0.146 0.500 0.951 
R-CNN 
HB 3.516 0.133 0.491 0.902 
LC 3.305 0.125 0.380 0.951 
BC 3.306 0.125 0.381 0.951 
0.75 Baseline | 29.530 0.271 1.063 0.893 
HB 3.718 0.161 0.547 0.755 
LC 4.399 0.136 0.425 0.893 
BC 4.299 0.136 0.426 0.893 


miscalibration (Figs. 4 and 5), a considerable increase of the calibration error toward 
the image boundaries can be observed. This calibration error is already well mitigated 
by standard calibration methods and can be further improved by position-dependent 
calibration. Also note that the AUPRC is not affected by standard scaling meth- 
ods (logistic/beta calibration) as these methods perform a monotonically increasing 
mapping of the confidence estimates and thus do not affect the order of the samples. 
However, this is not the case with histogram binning which may lead to a signifi- 
cant drop of the AUPRC. Furthermore, even the position-dependent scaling methods 
cannot guarantee a monotonically increasing mapping. However, compared to the 
improvement of the calibration, the impact on the AUPRC is marginal. Therefore, 
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Table 2 Calibration results for object detection using all available information ), cx, cy, w and h. 
Similar to the confidence-only case in Table 1, the scaling methods consistently outperform baseline 
and histogram binning calibration. However, in the position-dependent case, the calibration mapping 
is not monotonically increasing. This has an effect to the AUPRC scores, but is marginal compared 
to the improvement in calibration 


Network Dataset IoU 
Faster Baseline 
R-CNN 
HB 3.891 0.206 1.076 0.701 
LC 3.100 0.175 0.627 0.807 
BC 3.240 0.168 0.508 0.807 
0.75 Baseline | 13.277 0.272 0.861 0.767 
HB 3.848 0.200 1.037 0.587 
LC 2.965 0.162 0.587 0.730 
BC 
RetinaNet | COCO 0.50 Baseline | 4.303 0.171 0.514 0.831 
HB 3.369 0.199 1.100 0.727 
LC 
BC 3.491 0.175 0.521 0.800 
0.75 Baseline | 6.724 0.181 0.544 0.765 
HB 
LC 2.872 0.157 0.595 0.732 
BC 3.205 0.153 0.479 0.725 
Mask Cityscapes | 0.50 Baseline | 10.168 | 0.146 0.500 0.951 
R-CNN 
HB 5.241 0.151 0.535 0.854 
LC 4.551 0.134 0.440 0.946 
BC 6.117 0.134 0.413 0.925 
0.75 Baseline | 27.066 | 0.271 1.063 0.893 
HB 8.062 0.195 0.607 0.681 
LC 


we conclude that our calibration methods are a valuable contribution especially for 
safety-critical systems, since they lead to statistically better calibrated confidences. 


5.2 Instance Segmentation 


After object detection, we investigate the calibration properties of instance segmen- 
tation models. According to the definition of calibration for segmentation models 
in (9), we can use each pixel within a segmentation mask as a separate prediction. 
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(a) Uncalibrated (b) After standard calibration (c) After position-dependent calibration 


Precision / Confidence 


Expected Calibration Error (ECE) 


Cx position Cy position Cx position 


Fig. 4 Reliability diagram (object detection) for a Mask-RCNN class pedestrian inferred on 
Cityscapes Frankfurt images measured by the confidence and the relative cy position of each bound- 
ing box. a We observe an increasing calibration error toward the boundaries of the images. b This can 
be alleviated by standard logistic calibration ¢ as well as by position-dependent logistic calibration 


(a) Uncalibrated (b) After standard calibration (c) After position-dependent calibration 
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Fig.5 Reliability diagram (object detection) for a Mask-RCNN class pedestrian measured using all 
available information. This diagram shows the D-ECE [%] for all image regions. a We can observe 
a default miscalibration that increases toward the image boundaries. b Standard logistic calibration 
already achieves good calibration results, c that can be further improved by position-dependent 
calibration 


This alleviates the problem of limited data availability and allows for a more robust 
calibration evaluation. Recently, Kumar et al. [KLM19] started a discussion about 
the sample-efficiency of binning and scaling methods. The authors show that binning 
methods yield a more robust calibration mapping for large datasets but also tend to 
overfit for a small database. We can confirm this observation as histogram binning 
provides poor calibration performance in our examinations on object detection cal- 
ibration, particularly for classes with fewer samples. In contrast, scaling methods 
are more sample-efficient but also more inaccurate [KLM19]. Furthermore, scaling 
methods are computationally more expensive as they require an iterative update of 
the calibration parameters over the complete dataset, whereas a binning of samples 
comes at low computational costs, especially for large datasets. Therefore, our exam- 
inations are focused on the calibration performance of a (multivariate) histogram 
binning using 15 bins for each dimension. For inference, we use pretrained Mask 
R-CNN [HGDG17, WKM+19] as well as pretrained PointRend [KWHG20] mod- 
els to obtain predictions of instance segmentation masks for both datasets. As within 
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object detection evaluation, all objects with a bounding box score below 0.3 are 
neglected. We perform standard calibration using the confidence only, as well as a 
position-dependent calibration including the x and y position of each pixel. The pixel 
position is scaled by the mask’s bounding box size to get position information in the 
[0, 1] interval. Furthermore, we also include the distance of each pixel to the nearest 
mask segment boundary as a feature in a calibration mapping since we expect a 
higher uncertainty especially at the segment boundaries. The distance is normalized 
by the bounding box’s diagonal to obtain distance values in [0, 1]. 

Since many data samples are available, we measure calibration using a D-ECE 
with 15 bins neglecting each bin with less than 8 samples. We further assess the 
Brier score as well as the NLL loss as complementary metrics. The task of instance 
segmentation is related to object detection. In a first step, the network also needs 
to infer a bounding box for each detected object. Similar to the calibration evalu- 
ation for object detection, we further use IoU scores of 0.50 and 0.75 to specify 
whether a prediction has matched a ground-truth object or not. In contrast to object 
detection, the IoU score within instance segmentation is computed by the overlap 
of the inferred segmentation mask and the according ground-truth object. Using 
this definition, we can compute the AUPRC to evaluate the average quality of the 
object detection branch as well as the according segmentation masks. All results for 
instance segmentation calibration for IoU of 0.50 and 0.75 are shown in Tables 3 
and 4, respectively. These tables compare standard calibration (subset: confidence 
only) with position-dependent calibration (subset: full). We observe a significant 
miscalibration of the networks by default. Using standard calibration or our cali- 
bration methods, it is possible to improve the calibration score D-ECE, Brier, and 
NLL for each case. The miscalibration is successfully corrected by histogram bin- 
ning using either the standard binning or the position-dependent binning. This is 
also underlined by inspecting the reliability diagrams for the confidence-only case 
(Fig. 6). Furthermore, we can also observe miscalibration that is dependent on the 
relative x and y pixel position within a mask (Figs. 7 and 8). We can observe that 
standard histogram binning calibration already achieves good calibration results but 
also offers a weak dependency on the pixel position as well. Although the position- 
dependent calibration only achieves a minor improvement in calibration compared 
to the standard calibration case, it shows an equal calibration performance across the 
whole image 

In contrast to object detection, we observe a slightly increased miscalibration 
toward the center of a mask. As it can be seen in Fig. 9, most pixels belonging to an 
object mask are also located in the center. This underlines the need for a position- 
dependent calibration. Interestingly, although position-dependent calibration does 
not seem to offer better Brier or NLL scores compared to the confidence-only case, 
it significantly improves the mask quality as we can observe higher AUPRC scores 
for position-dependent calibration. We further provide a qualitative example that 
illustrates the effect of confidence calibration in Fig. 9. In this example, we can see 
the difference between standard calibration and position-dependent calibration. In 
the former case, the mask scores are only rescaled by their confidence which might 
lead to a better calibration but sometimes also to unwanted losses of mask segments 
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Table 3 Calibration results for instance segmentation @ IoU=0.50. The best D-ECE scores are 
underlined for each subset separately, since it is not convenient to compare D-ECE scores with 
different subsets to each other [KKSH20]. Furthermore, the best Brier, NLL, and AUPRC scores are 
highlighted in bold. These scores are only calculated using the confidence information and thus can 
be compared to each other. The histogram binning calibration consistently improves miscalibration. 
Furthermore, we find that although position dependence does not improve the Brier and NLL scores 
as in the standard case, it leads to a significant improvement in the mIoU score 


Network Dataset | Cal. Subset: confidence only Subset: full 
method 
D-ECE | Brier | NLL | AUPRC | D-ECE | Brier | NLL | AUPRC 
[%] [%] 
Mask CS Baseline | 7.088 0.110 | 0.432 | 0.724 11.205 | 0.110 | 0.432 | 0.724 
R-CNN 
HB 5.661 | 0.099 | 0.320 | 0.723 10.119 | 0.117 | 0.530 | 0.787 
COCO | Baseline | 21.983 | 0.222 | 0.940 | 0.663 23.409 | 0.222 | 0.940 | 0.663 
HB 6.420 | 0.150 | 0.442 | 0.662 13.640 | 0.171 | 0.776 | 0.760 
PointRend | CS Baseline | 12.893 | 0.160 | 0.785 | 0.709 20.858 | 0.160 | 0.785 | 0.709 
HB 2.706 | 0.105 | 0.326 | 0.698 19.021 | 0.190 | 1.299 | 0.758 
COCO | Baseline | 22.284 | 0.222 | 0.946 | 0.672 23.973 | 0.222 | 0.946 | 0.672 
HB 6.348 | 0.144 | 0.428 | 0.664 16.060 | 0.180 | 1.005 | 0.751 


Table 4 Calibration results for instance segmentation @ IoU=0.75. The best scores are highlighted. 
We observe the same behavior in calibration for all models as for the IOU=0.50 case shown in Table 3 


Network Dataset | Cal. Subset: confidence only Subset: full 
method 
D-ECE | Brier | NLL | AUPRC | D-ECE | Brier | NLL | AUPRC 
[%] [%] 
Mask CS Baseline | 12.862 | 0.147 | 0.622 | 0.375 14.470 | 0.147 | 0.622 | 0.375 
R-CNN 
HB 5.895 | 0.108 | 0.340 | 0.375 10.273 | 0.120 | 0.523 | 0.479 
COCO | Baseline | 26.559 | 0.250 | 1.070 | 0.237 27.219 | 0.250 | 1.070 | 0.237 
HB 6.006 | 0.144 | 0.423 | 0.235 12.888 | 0.165 | 0.720 | 0.425 
PointRend| CS Baseline | 18.668 | 0.192 | 0.929 | 0.347 25.386 | 0.192 | 0.929 | 0.347 
HB 3.899 | 0.115 | 0.349 | 0.344 18.415 | 0.191 | 1.247 | 0.486 
COCO | Baseline | 26.643 | 0.248 | 1.060 | 0.258 27.377 | 0.248 | 1.060 | 0.258 
HB 6.708 | 0.138 | 0.411 | 0.238 15.344 | 0.173 | 0.936 | 0.388 


(especially small objects in the background). In contrast, position-dependent cali- 
bration is capable of possible correlations between confidence and pixel position 
or the size of an object. This leads to improved estimates of the mask confidences 
even for smaller objects. Therefore, we conclude that multidimensional confidence 
calibration has a positive influence on the calibration properties as well as on the 
quality of the object masks. 
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(a) Uncalibrated (b) After standard calibration (c) After position-dependent calibration 
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Fig. 6 Reliability diagram (instance segmentation) for a Mask R-CNN class pedestrian inferred 
on MS COCO validation images measured using the confidence only. The blue bar denotes the 
frequency of samples that fall into this bin, with the gap to perfect calibration (diagonal) highlighted 
in red. a The Mask R-CNN model is consistently overconfident for each confidence level. b This 
can be alleviated by standard histogram binning calibration ¢ as well as by position-dependent 
logistic calibration 
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Fig. 7 Reliability diagram (instance segmentation) fora Mask R-CNN class pedestrian inferred 
on MS COCO validation images measured by the relative x position of each pixel within its bound- 
ing box. a Similar to the examinations for object detection, we observe an increasing calibration 
error toward the boundaries of the images. b Standard histogram binning significantly reduces the 
calibration error, ¢ whereas position-dependent histogram binning is capable of a slightly further 
refinement of the position-dependent D-ECE 
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Fig. 8 Reliability diagram (instance segmentation) fora Mask R-CNN class pedestrian inferred 
on MS COCO validation images. The diagram shows the D-ECE [%] that is measured using the 
relative x and y position of each pixel within its bounding box. Similar to the observations in 
Fig. 7, standard calibration already reduces the calibration error. However, we can observe a slightly 
increased calibration error toward the image’s center. In contrast, position-dependent calibration 
error shows equal calibration performance across the whole image 
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(a) Uncalibrated confidence 


(b) After standard calibration (c) After position-dependent calibration 


Fig. 9 Qualitative example of a Cityscapes image with instance segmentation masks obtained by 
a Mask R-CNN. a Uncalibrated confidence of the predicted masks (mIoU = 0.289). b Instance 
masks are calibrated using standard histogram binning (mIloU = 0.343). c Mask confidences using a 
position-dependent confidence calibration (mloU = 0.370). Unlike a simple scaling of confidences 
in b, the calibration in ¢ offers a more fine-grained mapping that achieves a better fit to the true 
instance masks 


5.3 Semantic Segmentation 


For the evaluation of semantic segmentation, we use the same datasets, the same bin- 
ning scheme, and the same features that have already been used for the instance seg- 
mentation in Sect. 5.2. For COCO mask inference, we use a pretrained Deeplabv2 
[CPK+18] as well as a pretrained HRNet model [SXLW19, WSC+20, YCW20]. 
For Cityscapes, we also use a HRNet as well as a pretrained Deeplabv3+ model 
[CZP+18]. Similar to the instance segmentation, we also use the D-ECE, Brier score, 
NLL loss, and mloU to compare the baseline calibration with a histogram binning 
model. As opposed to our previous experiments, we only use 15% of the provided 
samples to reduce computational complexity. The results for default miscalibration, 
standard calibration (subset: confidence only) and position-dependent calibration 
(subset: full) are shown in Table 5. Unlike instance segmentation calibration, the 
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Table 5 Calibration results for semantic segmentation. The best D-ECE scores are underlined for 
each subset separately, since it is not convenient to compare D-ECE scores with different subsets 
to each other [KKSH20]. Furthermore, the best Brier, NLL, and mloU scores are highlighted in 
bold. These scores are only calculated using the confidence information and thus can be compared 
to each other. In contrast to object detection and instance segmentation evaluation, we observe a 
low baseline calibration error that is only slightly affected by confidence calibration 


Network Dataset | Cal. Subset: confidence only Subset: full 
method 
D-ECE | Brier | NLL | mloU | D-ECE | Brier | NLL | mloU 
[%] [%] 
Deeplabv3+| CS Baseline | 0.162 | 0.060 | 0.139 | 0.623 | 0.186 | 0.060 | 0.139 | 0.623 
HB 0.081 | 0.060 | 0.170 | 0.619 | 0.199 | 0.062 | 0.189 | 0.589 
Deeplabv2 | COCO | Baseline | 0.094 | 0.458 | 1.173 | 0.933 | 0.149 | 0.458 | 1.173 | 0.933 
HB 0.060 | 0.456 | 1.515 | 0.933 | 0.150 | 0.485 | 1.790 | 0.913 
HRNet CS Baseline | 0.067 | 0.057 | 0.115 | 0.629 | 0.149 | 0.057 | 0.115 | 0.629 
HB 0.083 0.057 | 0.148 | 0.628 | 0.191 0.060 | 0.171 | 0.582 
COCO | Baseline | 0.455 0.779 | 5.812 | 0.939 | 0.455 0.779 | 5.812 | 0.939 
HB 0.062 | 0.563 | 2.261 | 0.939 | 0.138 | 0.571 | 2.372 | 0.931 


baseline model is already intrinsically well-calibrated, offering a very low calibration 
error. A qualitative example is shown in Fig. 10 to illustrate the effect of calibration 
for semantic segmentation. In this example, we can observe only minor differences 
between the uncalibrated mask and the masks either after standard calibration or 
after position-dependent calibration. This aspect is supported by the confidence reli- 
ability diagram shown in Fig. 11. Although we can observe an overconfidence, most 
samples are located in the low confident space with a low calibration error that also 
results in an overall low miscalibration score. Furthermore, our calibration methods 
are able to achieve even better calibration results, but mostly in the confidence-only 
calibration case. In addition, we also observe only a low correlation between position 
and calibration error (Figs. 12 and 13). One reason for the major difference between 
instance and semantic segmentation calibration may be the difference in model train- 
ing. An instance segmentation model needs to infer an appropriate bounding box first 
to achieve qualitatively good results in mask inference. In contrast, a semantic seg- 
mentation model does not need to infer the position of objects within an image but is 
able to directly learn and improve mask quality. We also suspect an influence of the 
amount of available data points, since a semantic segmentation model is able to use 
each pixel within an image as a separate sample, whereas an instance segmentation 
model is only restricted to the pixels that are available within an estimated bound- 
ing box. Therefore, we conclude that semantic segmentation models do not require a 
post-hoc confidence calibration as they already offer a good calibration performance. 
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(a) Uncalibrated confidence 


1.0 


0.0 
(b) After standard calibration (c) After position-dependent calibration 


Wi 


Fig. 10 Qualitative example of a Cityscapes image with semantic segmentation masks obtained by 
a Deeplabv3+ model. a Uncalibrated confidence of the predicted masks (mIoU = 0.576), b here, 
the pixel masks are calibrated using standard histogram binning (mloU = 0.522). Furthermore, ¢ 
mask confidences using a position-dependent confidence calibration (mloU = 0.570). Although 
we observe some minor differences, we conclude that confidence calibration does not have a major 
influence to the calibration properties for semantic segmentation models 


(a) Uncalibrated (b) After standard calibration (c) After position-dependent calibration 


Accuracy 


0.6 0.8 
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Fig. 11 Reliability diagram (semantic segmentation) for a Deeplabv3 + class pedestrian inferred 
on Cityscapes Frankfurt images measured using the confidence only. The blue bar denotes the 
accuracy of samples that fall into this bin, with the gap to perfect calibration (diagonal) highlighted 
in red. By default (a), the Deeplabv3+ model already offers a good calibration performance that 
is only slightly affected either by standard calibration (b) or by position-dependent calibration (c) 
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(a) Uncalibrated (b) After position-dependent calibration (c) After position-dependent calibration 
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Fig. 12 Reliability diagram (semantic segmentation) for a Deeplabv3 + class pedestrian inferred 
on Cityscapes Frankfurt images measured by the relative x position of each pixel. Since the model 
has a low miscalibration by default (a), the effect of calibration in b or ¢ is marginal 
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Fig. 13 Reliability diagram (semantic segmentation) fora Deeplabv3+ class pedestrian inferred 
on Cityscapes Frankfurt images. This diagram shows the D-ECE [%] that is measured using the 
relative x and y position of each pixel. Since the model has a low miscalibration by default (a), the 
effect of calibration in b or ¢ is marginal 


Conclusions 


Within the scope of confidence calibration, recent work has mainly focused on the 
task of classification calibration. In this chapter, we presented an analysis of confi- 
dence calibration for object detection calibration as well as, for instance, and semantic 
segmentation calibration. Firstly, we introduced definitions for confidence calibration 
within the scope of object detection, instance segmentation, and semantic segmen- 
tation. Secondly, we presented methods to measure and alleviate miscalibration of 
detection and segmentation networks. These methods are extensions of well-known 
calibration methods such as histogram binning [ZEO1], Platt scaling [Pla99], and 
beta calibration [KSFF17]. We extend these methods so that they encompass addi- 
tional calibration information such as position and shape. Finally, the experiments 
revealed that common object detection models as well as instance segmentation net- 
works tend to miscalibration. In addition, we showed that auxiliary information such 
as estimated position or shape of a predicted object also have an influence to confi- 
dence calibration. However, we also found that semantic segmentation models are 
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already intrinsically calibrated. Thus, the examined models do not require additional 
post-hoc calibration and already offer well-calibrated mask confidence scores. We 
argue that this difference between instance and semantic segmentation is a result 
of data quality and availability during training. This leads to the assumption that 
limited data availability is a direct cause for miscalibration during training and thus 
an effect of overfitting. Nevertheless, our proposed calibration framework is capa- 
ble to calibrate object detection and instance segmentation models. In safety-critical 
applications, the confidence in the applied algorithms is paramount. The proposed 
calibration algorithms allow to detect situations of low confidence and thus perform 
the appropriate system reaction. Therefore, calibrated confidence values can be used 
as additional information especially in safety-critical applications. 
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Uncertainty Quantification for Object R) 
Detection: Output- and Gradient-Based gert 
Approaches 


Tobias Riedlinger, Marius Schubert, Karsten Kahl, and Matthias Rottmann 


Abstract Safety-critical applications of deep neural networks require reliable confi- 
dence estimation methods with high predictive power. However, evaluating and com- 
paring different methods for uncertainty quantification is oftentimes highly context- 
dependent. In this chapter, we introduce flexible evaluation protocols which are appli- 
cable to a wide range of tasks with an emphasis on object detection. In this light, 
we investigate uncertainty metrics based on the network output, as well as metrics 
based on a learning gradient, both of which significantly outperform the confidence 
score of the network. While output-based uncertainty is produced by post-processing 
steps and is computationally efficient, gradient-based uncertainty, in principle, allows 
for localization of uncertainty within the network architecture. We show that both 
sources of uncertainty are mutually non-redundant and can be combined beneficially. 
Furthermore, we show direct applications of uncertainty quantification by improving 
detection accuracy. 


1 Introduction 


Deep artificial neural networks (DNNs) employed for tasks such as object detection 
or semantic segmentation yield by construction probabilistic predictions on feature 
data, such as camera images. Modern deep object detectors [RHGS15, LAE+16, 
LGG+17, RF18] predict bounding boxes for depicted objects of a given set of learned 
classes. The so-called “score” (sometimes also called objectness or confidence score) 
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Fig. 1 Confidence calibration plots for score (left), Monte-Carlo (MC) dropout uncertainty (center), 


and our proposed gradient uncertainty metrics G (right) on po. For the sake of reference, we 


provide the optimally calibrated diagonal in gray. See Sect. 4 for more details 


indicates the probability that a bounding box provided by the DNN contains an 
object. Throughout this chapter, we sometimes use the term “confidence” instead 
of “score” to refer to quantities that represent the probability of a detection being 
correct. The term confidence in a broader sense is meant to reflect an estimated degree 
of competency of the model when evaluating an input. 

Reliable and accurate probabilistic predictions are desirable for many applica- 
tions, such as medical diagnosis or automated driving. It is well-known that pre- 
dictions of DNNs often tend to be statistically miscalibrated [SZS+14, GSS15, 
GPSW17], i.e., the score computed by the DNN is not representative for the rel- 
ative frequency of correct predictions. For an illustration of this, see Fig. 1, where 
we compare different sources of uncertainty in terms of their accuracy conditioned 
to confidence bins for a YOLOv3 model. 

For individual DNN predictions, confidences that are not well-calibrated are mis- 
leading and constitute a reliability issue. Over-confident predictions might cause 
inoperability of a perception system due to producing non-existent instances (false 
positives / FP). Perhaps even more adverse, under-confidence might lead to false 
negative (FN) predictions, for instance overlooking pedestrians or other road users 
which could lead to dangerous situations or even fatalities. Given that the set of avail- 
able data features is fixed, the sources of uncertainty in machine learning (and deep 
learning) can be divided into two types [HW21], referring to their primary source 
[Gal17, KG17]. Whereas aleatoric uncertainty refers to the inherent and irreducible 
stochasticity of the distribution the data stems from, epistemic uncertainty refers to 
the reducible part of the uncertainty. The latter originates from the finite size of a ran- 
dom data sample used for training, as well as from the choice of model and learning 
algorithm. 

Up to now, there exist a moderate number of established methods in the field of 
deep learning that have demonstrated to properly estimate the uncertainty of DNNs. 
In [KG17], DNNs for semantic segmentation are equipped with additional regres- 
sion output that model heteroscedastic (in machine learning also called predictive) 
uncertainty, with the aim to capture aleatoric uncertainty. For training, the uncer- 
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tainty output is integrated into the usual empirical cross-entropy loss. A number of 
adaptations for different modern DNN architectures originate from that work. From 
a theoretical point of view, Bayesian DNNs [DL90, Mac92], in which the model 
weights are treated as random variables, yield an attractive framework for capturing 
the epistemic uncertainty. In practice, the probability density functions corresponding 
to the model weights are estimated via Markov chain Monte-Carlo, which currently 
is infeasible already for a moderate number of model weights, hence being inapplica- 
ble in the setting of object detection. As a remedy, variational inference approaches 
have been considered, where the model weights are sampled from predefined dis- 
tributions, which are often modeled in simplified ways, e.g., assuming mutual inde- 
pendence. A standard method for the estimation of epistemic uncertainty based on 
variational inference is Monte-Carlo (MC) dropout [SHK+14, GG16]. Algorithmi- 
cally, this method performs several forward passes under dropout at inference time. 
Deep ensemble sampling [LPB17] follows a similar idea. Here, separately trained 
models with the same architecture are deployed to produce a probabilistic inference. 

In semantic segmentation, Rottmann et al. [RCH+20] introduced the tasks of meta 
classification and regression that are both applied as post-processing to the output 
of DNNs. Meta classification refers to the task of classifying a prediction as true 
positive (TP) or FP. Meta regression is a more object detection-specific task with 
the aim to predict the intersection-over-union (JoU) of a prediction with its ground 
truth directly. These tasks are typically learned with the help of ground truth, but 
afterwards performed in the absence of ground truth, only on the basis of uncer- 
tainty measures stemming from the output of the DNN. The framework developed in 
[RCH+20], called MetaSeg, was extended into multiple directions: Time-dynamic 
uncertainty quantification for semantic segmentation [MRG20] and instance seg- 
mentation [MRV+21]; a study on the influence of image resolution [RS19]; con- 
trolled reduction of FN and meta fusion, the latter utilizes the uncertainty metrics 
to increase the performance of the DNN [CRH+19]; out-of-distribution detection 
[BCR+20, ORF20, CRG21]; and active learning [CRGR21]. Last but not least, a 
similar approach solely based on the output of the DNN has been developed for object 
detection in [SKR20] (called MetaDetect), however, equipped with different uncer- 
tainty metrics specifically designed for the task of object detection. This approach 
was also used in an uncertainty-aware sensor fusion approach for object detection 
based on camera images and RaDAR point clouds; see [KRBG21]. 

In [ORG18], it was proposed to utilize a learning gradient, replacing the usually 
needed ground truth labels by the prediction of the network itself, which is sup- 
posed to contain epistemic uncertainty information. Since the investigation of gra- 
dient metrics for image classification tasks, that method was considered in different 
applications as well. Gradient metrics were compared to other uncertainty quan- 
tification methods in [SFKM20]. It was found that gradient uncertainty is strongly 
correlated with the softmax entropy of classification networks. In natural language 
understanding, gradient metrics turned out to be beneficial; see [VSG19]. Therein, 
gradient metrics and deep ensemble uncertainty were aggregated and yield well- 
calibrated confidences on out-of-distribution data. Recently, gradient metrics have 
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been developed for object detection in [RRSG21]. In order to demonstrate the predic- 
tive power of uncertainty metrics, the framework of meta classification and regression 
provided by MetaDetect was used. 

In this chapter, we review the works published in [SKR20, RRSG21] and put 
them into a common context. More precisely, our contributions are as follows: 


e We review the concepts of meta classification and regression, meta fusion, and 
confidence calibration. We explain how they serve as a general benchmark for 
evaluating the predictive power of any uncertainty metric developed for object 
detection. 

We review output-based uncertainty metrics developed in [SKR20], as well as 
gradient-based ones from inside the network [RRSG21]. 

We compare baseline uncertainty measures such as the DNN’s score, well- 
established ones like Monte-Carlo dropout [Gal17], the output-based uncertainty 
metrics [SKR20], and finally the gradient-based uncertainty metrics [RRSG21] 
from inside the DNN. We do so in terms of comparing their standalone perfor- 
mances but also in terms of analyzing their mutual information and how much 
performance they add to the prediction of the network itself. 


2 Related Work 


In this section, we gather and discuss previous research related to the present work. 


Epistemic and aleatoric uncertainty in object detection: In recent years, there 
has been increasing interest in methods to properly estimate or quantify uncer- 
tainty in DNNs. Aleatoric uncertainty, the type of uncertainty resulting from the 
data generating process, has been proposed to be captured by providing addi- 
tional regression output to the network architecture for learning uncertainty directly 
from the training dataset [KG17]. Since the initial proposition, this approach has 
been adapted to, among others, the object detection setting [LDBK18, CCKL19, 
HZW+19, LHK+21, HSW20, CCLK20]. The latter approaches add further regres- 
sion variables for aleatoric uncertainty which are assigned to their bounding box 
localization variables, thereby learning localization as Gaussian distributions. An 
alternative approach to quantifying localization uncertainty is to learn for each pre- 
dicted box the corresponding JoU with additional regression variables [JLM+18]. 

Several methods have been proposed that aim to capture epistemic uncertainty, 
i.e., “reducible” kinds of uncertainty. Monte-Carlo (MC) dropout [SHK+14, GG16] 
employs forward passes under dropout and has become one of the standard tools for 
estimating epistemic uncertainty. As such, MC dropout has been investigated also 
for deep object detection [MNDS18, KD19, HSW20]. Similarly to MC dropout, 
deep ensemble methods [LPB17] employ separately trained networks (“experts”) to 
obtain variational inference, an approach which also has found adaptations in deep 
object detection [LGRB20]. 
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Uncertainty calibration: A large variety of methods have been proposed to rectify 
the intrinsic confidence miscalibration of modern deep neural networks by intro- 
ducing post-processing methods (see, e.g., [GPSW17] for a comparison). Prominent 
examples of such methods include histogram binning [ZE02], isotonic regression 
[ZEO1], Bayesian binning [NCH15], Platt scaling [Pla99, NMCOS5], or temperature 
scaling [HVD 14]. Alternatively, calibrated confidences have been obtained by imple- 
menting calibration as an optimization objective [KP20] via a suitable loss term. In 
the realm of object detection, the authors of [NZV18] found that temperature scal- 
ing improves calibration. Also, natural extensions to make calibration localization- 
dependent have been proposed [KKSH20]. 


Meta classification and meta regression: Meta classification denotes the task of dis- 
criminating FP and FN predictions based on uncertainty or confidence information, 
an idea that has been explored initially in [HG17]. In case a real-valued metric (e.g., 
different versions of the JoU which is in [0, 1]) can be assigned to the quality of a pre- 
diction, meta regression denotes the task of predicting this quantity in the same spirit 
as meta classification. In contrast to the IoUNet framework introduced in [JLM+18], 
meta classification and meta regression do not require architectural changes or ded- 
icated training of the respective DNN. In [RRSG21], it has been found that meta 
classifiers tend to yield well-calibrated confidences by default as opposed to the con- 
fidence score of DNNs. Applications of meta classification and meta regression have 
since been introduced for several different disciplines [CRH+19, RS19, MRG20, 
RMC+20, RCH+20, MRV+21, KRBG21], including semantic image segmentation, 
video instance segmentation, and object detection. In all of these applications, the 
central idea is to use a simple classification model to learn the map between uncer- 
tainty and the binary labels TP/FP. A related idea trying to exploit meta classification 
was introduced as “meta fusion” in [CRH+19]. The ability to discriminate TPs against 
FPs allows for an increase in the neural network’s sensitivity, thereby producing an 
increase in FPs which are subsequently detected by a meta classifier decreasing the 
total number of errors made. 


3 Methods 


3.1 Uncertainty Quantification Protocols 


Various methods for uncertainty or confidence estimation exist. While confidence 
scores are directly produced by a DNN, oftentimes other uncertainty-related quanti- 
ties, such as softmax margin or entropy, can be generated from the network output. 
In variational inference (e.g., MC dropout, deep ensembles, or batch norm uncer- 
tainty), for a single network input, several predictions are made. Their variance or 
standard deviation is then interpreted as a measure of prediction uncertainty. Often- 
times, acommon strategy for different kinds of uncertainty is missing and comparison 
between methods is highly context-dependent. Moreover, different methods to esti- 
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mate uncertainty may produce mutually redundant information which, for lack of 
comparability, cannot be established experimentally. We propose uncertainty aggre- 
gation methods as unifying strategy for different tasks. 


Meta classification: In classification tasks, an input is categorized as one class out 
of a finite number of classes. Evaluation usually takes a binary value of either true or 
false. Meta classification can then be understood as the task of predicting the label 
true or false from uncertainty metrics. To this end, a lightweight classification model 
(e.g., logistic regression, a shallow neural network, or a gradient boosting classifier) 
is fitted to the training dataset, where uncertainty metrics can be computed and the 
actual, true label for each sample can be established. This can be regarded as learning 
a binary decision rule based on uncertainty information. On an evaluation dataset, 
such a meta classification model can be evaluated in terms of classification metrics 
like the area under the receiver operating characteristic AuROC or the area under 
precision-recall curve AuPR [DG06]. Different sources of uncertainty information, 
thus, serve as co-variables for meta classification models. Their performance can 
be regarded as a unified evaluation of confidence information. Network-intrinsic 
confidence scores naturally fit into this framework and can be either directly evaluated 
as any meta classification model or also serve as a single co-variable for such a model. 
In particular, different sources can be combined by fitting a meta classification model 
on the combined set of uncertainty metrics. This allows for the investigation of mutual 
redundancy between uncertainty information. Meta classification can be applied to 
any setting where a prediction can be marked as true or false and has served as a 
baseline for image classification [HG17], but also the tasks of semantic segmentation 
(“MetaSeg’’) [RCH+20] and object detection (“MetaDetect’’) [SKR20] can be subject 
to meta classification. In semantic segmentation, segments can be classified as true or 
false based on their intersection-over-union (JoU) with the ground truth being above 
a certain threshold. Similarly, bounding boxes are ordinarily declared as true or false 
predictions based on their maximum JoU with any ground truth box (see Sect. 3.2). 


Meta fusion: Obtaining well-performing meta classifiers allows for their incorpo- 
ration in the form of decision rules (replacing the standard score threshold) into the 
original prediction pipeline. Doing so is especially useful in settings where the DNN 
prediction is based on positive (foreground/detection) and negative (background) 
predictions, such as semantic segmentation and object detection. Implementing meta 
classifiers into such a framework is called meta fusion, whereby the improvement 
of the detection of false predictions over the DNN confidence is supplemented by 
increased prediction sensitivity of the DNN (see Fig. 2). Greater sensitivity tends to 
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reduce false negative (FN) predictions and increase true (TP) and false positive (FP) 
predictions. The meta classifier’s task is then to identify the additional FPs as such 
and filter them from the prediction which leads to a net increase in prediction perfor- 
mance of the combination DNN with the meta classifier over the standalone DNN 
baseline. Such an increase can be evaluated utilizing metrics usually employed to 
assess the performance of a DNN in the respective setting, e.g., mean JoU in seman- 
tic segmentation [CRH+19] or mean average precision (mAP) in object detection 
[RRSG21]. 


Calibration: Confidence estimates with high predictive power (as measured in meta 
classification) have useful applications, e.g., in meta fusion where detection accuracy 
can be improved by investing uncertainty information. In addition to this advantage, 
there is another aspect to the probabilistic estimation of whether the corresponding 
prediction is correct or not. Confidence calibration tries to gauge the accuracy of the 
frequentistic interpretation, for example, from 100 examples each with a confidence 
of about L, statistically we expect about 25 of those examples to be true predictions. 

This is often investigated by dividing the confidence range into bins of equal 
width and computing the accuracy of predictions conditioned to the respective bin 
(see Fig. 3). For calibrated confidences, the resulting distribution should then be 
close to the ideally calibrated diagonal. Meta classification is closely related to post- 
processing calibration methods in that a map is learned on a validation dataset which 
yields confidences with improved calibration. In particular, meta classification with 
the confidence score as the only co-variable is (up to the explicit optimization objec- 
tive) isotonic regression as introduced in [ZE02]. 


Meta regression: In a regression task, a continuous variable (e.g., in R) is estimated 
given an input. By the same logic as in meta classification, whenever the quality 
of the prediction of a DNN can be expressed as a real number, we may fit a sim- 
ple regression model (e.g., linear regression, shallow neural network, or a gradient 
boosting regression model) on a training dataset where both uncertainty metrics and 
the actual prediction quality can be computed. This is called a meta regression model 
which maps uncertainty information to an (e.g., R-valued) estimation of the predic- 
tion quality of the DNN (see Fig. 4). The relationship between actual and predicted 
quality can, again, be evaluated on a dedicated dataset in terms of the coefficient 
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of determination R?, a well-established quality metric for regression models. Meta 
regression shares many features as an evaluation protocol as meta classification, per- 
haps with the exception that network confidence scores fit less directly into the meta 
regression logic. However, a meta regression model based on such a score still fol- 
lows the same design. We note that fitting non-linear regression models (e.g., a neural 
network or a gradient boosting as opposed to a linear regression) on one variable 
tends to not significantly improve meta regression performance. The JoU or modi- 
fications thereof in semantic segmentation [RCH+20] or object detection [SKR20] 
are suitable quality measures to be used in meta regression. 


3.2 Deep Object Detection Frameworks 


We develop output- and gradient-based uncertainty metrics for the task of 2D bound- 
ing box (“object”) detection on camera images x € 17*”*C, with I = [0, 1]. A DNN 
tailored to this task usually produces a fixed number Nou € N of output boxes 


OG; 0) = (D' (x; 0), ..., OY™ (x; 0)) € RI x S KI, (1) 


where 0 € R^ with Npnn € N denotes the number of weights in the respec- 
tive DNN architecture and Nc €N is the fixed number of classes (the set of 
classes is denoted by Nc = {1,..., Nc}) to be learned. Each output box D/ (x; 0) = 
(EI, Si, PÍ) e€ Rt x S x IN for j € Nou = {1,..-, Nou} is encoded by 


e Four localization variables, e.g., é — (lin Pia êlax, Plax) € R* (top-left and 
bottom-right corner coordinates), 

e Confidence score $ € S = (0, 1) indicating the probability of an object existing 
at €/, and 
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e Class probability distribution p/ = (p pi, ol Dis) e [X, 


The predicted class of Dİ (x; 0) is determined to be R} = arg MAX env, De: The Nout 


boxes subsequently undergo filtering mechanisms resulting in |Nx| = Nx detected 
instances Vy © Nout 


i= (Blox: 0), ..., B(x: o) E RÊX GHIN), 2) 


Commonly, the much smaller number Nx < Nour of predicted boxes is the result 
of two mechanisms. First, score thresholding, i.e., discarding any Dİ (x; 0) with 
sÍ < e, for some fixed threshold e, > 0, is performed as an initial distinction between 
“foreground” and “background” boxes. Afterward, the non-maximum suppression 
(NMS) algorithm is used to reduce boxes with the same class and significant mutual 
overlap to only one box. This way, for boxes that are likely indicating the same 
object in x, there is only one representative in y. By “overlap”, we uspally mean 


the intersection-over-union (JoU) between two bounding boxes E and A which is 


defined as ; 
Aj ak 
E NE 


Aj nk 


E UE 


lou (8',8') = 8) 


i.e., the ratio of their area of intersection and their joint area. The JoU between two 
boxes is always in [0, 1], where 0 means no overlap and 1 means the boxes have 
identical localization. One also rates the localization quality of a prediction in terms 
of the maximum JoU it has with any ground truth box on x which has the same class. 
The NMS algorithm is based on what we call “candidate boxes”. An output 
instance O* (x; 0) is a candidate box for another output instance D/ (x; 0) if it 


1. has a sufficiently high score (s* > e, above a chosen, fixed threshold £, > 0), 
2. has the same class as D/ (x; 0), i.e., Rt = RI, 
‘ aE az 
3. has sufficient JoU with D/ (x; 0), i.e., IOU (È , E”) > £u for some fixed threshold 
Elou = O (oftentimes £py = 5). 


We then denote by 
cand [oi (x; 0)] = 1 oF (x; 0) : O (x; 0) is a candidate box for D/ (x; 0)} (4) 


the set of candidate boxes for D/(x; 0). In the NMS algorithm, all output boxes 
D(x; 0) are sorted according to their score in descending order. Iteratively, the box 
Dİ (x; 0) with the highest score is selected as a prediction and cand[9/ (x; 0)] is 
removed from the ranking of output boxes. This procedure is repeated until there are 
no boxes left. Note that NMS usually follows score thresholding so this stage may 
be reached quickly, depending on £s. 

DNN object detection frameworks are usually trained in a supervised manner 
on images with corresponding labels or ground truth. The latter usually consist of a 
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number Nx of ground truth instances y = (y',..., y**) € (R* x Nc)™ consisting of 
similar data as output boxes which we denote by the bar (-). In particular, each ground 
truth instance y’ has a specified localization p = (Clin Flini Tl ao Tha) € R4 and 
category «/ € Nc. From the network output © (x; 0) on an image x and the ground 
truth y, a loss function J(O(x; 0), y) is computed and iteratively minimized over 
0 by stochastic gradient descent or some of its variants. In most object detection 


frameworks, J splits up additively into 
J = Jg + Js + Jp, (5) 


with Jg punishing localization mistakes, J; punishing incorrect score assignments 
to boxes (incorrect boxes with high score and correct boxes with low score), and 
Jp punishing an incorrect class probability distribution, respectively. In standard 
gradient descent optimization, the weights 0 are then updated by the following rule: 


0 <0 — n Vo J (O; 8), Y), (6) 


where 7 > 0 is the either fixed or variable learning rate parameter. The learning 
gradient g(x, 0, y) := Va J (O (x; 0), y) will play a central role for the gradient-based 
uncertainty metrics which will be introduced in Sect. 3.4. 


3.3 Output-Based Uncertainty: MetaDetect 


In this section, we construct uncertainty metrics for every B/ (x; 0) € §(x, 0). We 
do so in two stages, first by introducing the general metrics that can be obtained 
from the object detection pipeline. Second, we extend this by additional metrics 
that can be computed when using MC dropout. We consider a predicted bounding 
box Bİ (x; 0) € §(x, 0) and its corresponding filtered candidate boxes D* (x; 0) € 
cand[B/ (x; 0)] that were discarded by the NMS. 

The number of corresponding candidate boxes D* (x; 0) € cand[B/ (x; 0)] filtered 
by the NMS intuitively gives rise to the likelihood of observing a true positive. 
The more candidate boxes D*(x; 0) belong to B! (x; 0), the more likely it seems 
that B/ (x; 0) is a true positive. We denote by NO’ the number of candidate boxes 
DÝ (x; 0) belonging to B/ (x; 0), but suppressed by NMS. We increment this number 
by 1 and also count in B’ (x; 0). 

For a given image x, we have the set of predicted bounding boxes y(x, 0) and 
the ground truth y. As we want to calculate values that represent the quality of the 
prediction of the neural network, we first have to define uncertainty metrics for the 
predicted bounding boxes in (x, 0). For each B’ (x; 0) € (x, 0), we define the 
following quantities: 


e the number of candidate boxes N“) > 1 that belong to B/(x; 0) (i.e., BI (x; 0) 
belongs to itself; one metric), 
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e the predicted box B (x; 0) itself, i.e., the values of the tuple 
aio oaj al aj aj aj 4 N 
(eres eee se ere a eR xS xr”, (7) 


as well as )>;. Ne pl € R whenever class probabilities are not normalized (6 + Nc 
metrics), 

e sized = (Finax = rl): (Chax a ĉi) and circumference g = 2- (hax — Pia) + 
2i (êlax — ĉi n) (two metrics), 

e loU os the JoU of 8/(x; 0) and the box with the second highest score that was 


suppressed by B’ (x; 0). This value is zero if there are no boxes corresponding to 
Bİ (x; 0) suppressed by the NMS (i.e., VN“ = 1; one metric), 

e the minimum, maximum, arithmetic mean, and standard deviation for G2. 
Plax, ĉl a» êlax, $4), size d and circumference g from 8/ (x; 0) and all the filtered 
candidate boxes that were discarded from %8/(x; 0) in the NMS (4 x 7 metrics), 

e the minimum, maximum, arithmetic mean, and standard deviation for the JoU 
of BI (x; 0) and all the candidate boxes corresponding to B’ (x; 0) that were 
suppressed in the NMS (four metrics), 

e relative sizes rd = d/g, rdmax = d /8min, Fdmin = d /8max, rdmean = d /8mean, and 
rdsta = d/ sta (five metrics), 

e the maximal JoU of B/(x; 0) and all ground truth boxes in y; this is not an input 
to a meta model but serves as the ground truth provided to the respective loss 
function. 


Altogether, this results in 47 + Nc uncertainty metrics which can be aggregated 
further in a meta classifier or a meta regression model. 

We now elaborate on how to calculate uncertainty metrics for every B/ (x; 0) € 
y(x,0) when using MC dropout. To this end, we consider the bounding box 
Bİ (x; 0) € §(x, 0) that was predicted without dropout and then we observe under 
dropout K times the output of the same anchor box that produced B’ (x; 0) and denote 
them by B/(x; 6), ..., B/(x; 0)x. For these K + 1 boxes B/(x; 0), B/ (x; 0), ..., 
3B/(x: 0)x, we calculate the standard deviation for the localization variables, the 
objectness score, and class probabilities. This is done for every B’ (x; 0) € F(x, 0) 
and results in 4 + 1 + Nc additional dropout uncertainty metrics which we denote 
by stdyc(-) for each of the respective 4 + 1 + Nc box features. Executing this pro- 
cedure for all available test images, we end up with a structured dataset. Each row 
represents exactly one predicted bounding box and the columns are given by the reg- 
istered metrics. After defining a training/test splitting of this dataset, we learn meta 
classification (loU > 0.5 vs. JoU < 0.5) and meta regression (quantitative JoU pre- 
diction) from the training part of this data. All the presented metrics, except for the 
true JoU, can be computed without the knowledge of the ground truth. We now want 
to explore to which extent they are suited for the tasks of meta classification and 
meta regression. 
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3.4 Gradient-Based Uncertainty for Object Detection 


In the setting provided in Sect. 3.2, a DNN learns from a data point (x, y) € 
J4xWxC x (Rt x Nc) by computing the gradient g(x, 0, Y). The latter is subse- 
quently used to iteratively adjust the current weights 0, for example, by the formula 
given in (6). As such, the quantity g(x, 0, y) can be interpreted as learning stress on 
0 induced by the training data (x, y). We explain next how a related quantity gives 
rise to confidence information. 


Some intuition about gradient uncertainty: Generally, the loss function J mea- 
sures the “closeness” of output boxes to the ground truth y and g(x, 0, y) expresses 
the induced change in 0 for any deviation from the ground truth. If we assume that 
y(x; 0) is close to y, only little change in 0 will be required in an update step. We, 
therefore, expect g(x, 0, y) to be of comparably small magnitude. A DNN which is 
already “well-trained” in the ordinary sense is expected to express high confidence 
when its prediction is correct for all practical purposes. Conversely, if the network 
output deviates from the ground truth significantly, learning on (x, y) induces a 
larger adjustment in 0, leading to a gradient of larger magnitude. In this situation, 
the confidence in this prediction should be small. 

The gradient g(x, 0, y) is not realizable as a measure of uncertainty or confidence, 
since it uses the ground truth y, which is not accessible during inference time. An 
approach to circumvent this shortcoming was presented in [ORG18], in that the 
ground truth y was replaced by the prediction of the DNN y on x. We format the 
prediction so that it has the same format as the ground truth. Particularly, in the initial 
application to image classification, the softmax output of the DNN was collapsed by 
an arg max to the predicted class before insertion into J. With respect to bounding 
box localization, there is no adjustment required. Additionally, when computing 


g(x,0,9) = Vo J (DO; 8), $), (8) 


we disregard the dependency of y on 0, thereby sticking to the design motivation 
from above. The quantity in (8) represents the weight adjustment induced in the 
DNN when learning that its prediction on x is correct. Uncertainty or confidence 
information is distilled from g(x, 0, y) by taking it to a scalar. To this end, it is 
natural to employ norms, e.g., || - |; or || - 2, but we also use other maps such as 
minimal and maximal entry as well as the arithmetic mean and standard deviation 
over the gradient’s entries (see (9)). Gradient uncertainty metrics allow for some 
additional flexibility. Regarding (8), we see that if J splits into additive terms (such 
as in the object detection setting, see (5)), gradient metrics allow the extraction of 
gradient uncertainty from each individual contribution. Additionally, we can restrict 
the variables for which we compute partial derivatives. One might, for example, be 
interested in computing the gradient g(x, 0¢, y) for the weights 0; from only one 
network layer £. In principle, this also allows for computing uncertainty metrics for 
individual convolutional filters or even individual weights, thus, also offering a way 
of localizing uncertainty within the DNN architecture. 
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m? (J) = |g, 0, lh; mD =|g(x, 8, Nilo, max(J) = max(g(x, 0, 9), 


6 (J) = min(g(x, 0,9), m9 (J) =std(g(x,0,9)), mhiean(J) = mean(g(x, 0, §)). 


(9) 


m 


Application to instance-based settings: The approach outlined above requires some 
adaptation in order to produce meaningful results for settings in which the network 
output O(x; 0) and the ground truth y consist of several distinct instances, e.g., 
as in object detection. In such a situation, each component DŻ (x; 0) needs to be 
assigned individual confidence metrics. If we want to estimate the confidence for 
§/ := Bİ (x; 0), we may regard §/ as the ground truth replacement entering the 
second slot of J. Then, the first argument of J needs to be adjusted according to y/. 
This is necessary because the loss J (O (x; 0), ĵ) expresses the mistakes made by all 
the output boxes, i.e., the prediction of instances which actually appear in the ground 
truth y but which have nothing to do with §/ would be punished. The corresponding 
gradient would, therefore, misleadingly adjust 0 toward forgetting to see such ground 
truth instances. Thus, it is a natural idea to identify cand[Y/] in the network output 
and only enter those boxes into J which likely indicate one and the same instance 
on x. We define the corresponding gradient 


ged (x, 6, 9’) ‘= Vo J (cand[Y/ ](x, 0), f’) F (10) 


which serves as the basis of gradient uncertainty in instance-based settings. The 
flexibility mentioned in the previous paragraph carries over to this definition and 
will be exploited in our experiments. 


4 Experimental Setup 


In this section, we explain our choice of datasets, models, and metrics as well as 
experimental setup and implementation details. 


4.1 Databases, Models, and Metrics 


We investigate three different object detection datasets in the Pascal VOC 
[EVGW+15], the MS COCO [LMB+14], and the KITTI vision benchmark 
[GLSU15]. Pascal VOC DY°C!? and MS COCO D©©°"” are vision datasets contain- 
ing images of a wide range of mostly everyday scenarios with 20 and 80 annotated 
classes, respectively. They both possess training and dedicated evaluation splits for 
testing and constitute established benchmarks in object detection. In order to test our 
methods in driving scenarios, we also investigate the KITTI dataset DST! which 
shows different German urban environments and comes with annotated training 
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Table 1 Dataset split sizes for training and testing 


VOC12 VOC 12 COCO17 COCO17 KITTI KITTI 
Dataset D irain Dest Diran Dyai Detain Deval 


Size 14,805 4,952 118,287 5,002 5,481 2,000 


images. From the annotated training set, we take 2,000 randomly chosen images 
for evaluation. An overview of the sizes of the respective training and evaluation 
datasets can be found in Table 1. 

The experiments we present here have been carried out based on a YOLOV3 
[RF18] re-implementation in PyTorch. We have trained our model from scratch 
on each of the three training splits under dropout with a probability 0.5 between the 
last and the second-to-last convolution layer in each of the regression heads. As meta 
classification and meta regression models, we employ the gradient boosting models 
in [CG16] with standard settings. 

For the evaluation of our meta classification models, we use cross-validated area- 
under-curve metrics (AuROC and AuPR [DG06]) instead of accuracy, since the 
former measure the classification quality independently of any threshold, whereas 
accuracy only reflects the decisions made for the classification probability 0.5. For 
cross-validation, we use 10 individual image-wise splits of uncertainty metrics for 
the training of meta classifiers and the respective complementary split for evalua- 
tion. This particular procedure is employed whenever cross-validation is used in the 
following experiments. Especially in the VOC and KITTI dataset, AuROC values 
are mostly over 0.9, so we regard AuPR as a second classification metric. Object 
detection quality in meta fusion experiments will on the one hand be evaluated in 
terms of cross-validated mean average precision (mAP) [EVGW+15]. However, as 
the authors of [RF18] recognized mAP as computed in the VOC challenge is insensi- 
tive to FPs, which is why, on the other hand, we compute the cross-validated mean F; 
score (mF) as well. For the precision-recall curves of each class, we compute indi- 
vidual F scores which are sensitive to FPs and we average them over all classes as a 
complementary metric to mAP. Confidence calibration is usually evaluated in terms 
of expected or maximum calibration error (MCE) [NCH15], computed over bins of 
width 0.1. As the expected calibration error is sensitive to the box count per bin, we 
instead regard in addition to the MCE the average calibration error (ACE) [NZV 18]. 
Meta regression quality is evaluated by the usual coefficient of determination R?. 


4.2 Implementation Details 


Our object detection models receive a dropout layer between the second-to-last and 
the last layers in each of the three YOLOv3 detection heads. Dropout is active during 
training with a rate of 0.5 and also during MC inference, where we take standard 
deviations over 30 dropout samples for each of the 4 + 1 + Nc instance features of 
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all output boxes. The MetaDetect metrics introduced in Sect. 3.3 are computed from 
a score threshold of £, = 0.0, as it has been found in [SKR20] that lower thresholds 
lead to better performance in meta classification and meta regression. 

Gradient uncertainty metrics were computed for the weights in the second-to-last 
(“T — 1”) and the last (“T”) convolutional layers in the YOLOv3 architecture at the 
same candidate score threshold as for the MetaDetect metrics. As there are three 
detection heads, corresponding to a 76 x 76 (“S”), a 38 x 38 (“M”), anda 19 x 19 
CT”) cell grid, we also compute gradient uncertainty metrics individually for each 
detection head. We use this distinction in our notation and, for example, indicate the 
set of parameters from the last layer (T) of the detection head producing the 76 x 76 
cell grid (S) by 0 (T, S). Moreover, as indicated in Sect. 3.4, we also exploit the split 
of the loss function in (5). Each of the computed 2 x 3 x 3 = 18 gradients per box 
results in the 6 uncertainty metrics presented in (9) giving a total of 108 gradient 
uncertainty metrics per bounding box. Due to the resulting computational expense, 
we only compute gradient metrics for output boxes with score values $ > 1074 and 
regard only those boxes for all of the following experiments. 


4.3 Experimental Setup and Results 


Starting from pre-trained models, we compute the DNN output, as well as MC 
dropout, MetaDetect, and gradient uncertainty metrics on each evaluation dataset (see 
Table 1). For each output box, we also compute the maximum JoU with the respective 
ground truth such that the underlying relation can be leveraged. Before fitting any 
model, we perform NMS on the output boxes which constitutes the relevant case 
for our meta fusion investigation. In the latter, we are primarily interested in finding 
true boxes which are usually discarded due to their low assigned score. We compare 
the performance of different sets of uncertainty metrics in different disciplines. In 
the following, “Score” denotes the network’s confidence score, “MC” stands for MC 
dropout, and “MD” the MetaDetect uncertainty metrics as discussed in Sect. 3.3. 
Moreover, “G” denotes the full set of gradient uncertainty metrics from Sect. 3.4 
and the labels “MD+MC”, “G+MC”, “MD+G”, and “MD+G+MC” stand for the 
respective unions of MC dropout, MetaDetect, and gradient uncertainty metrics. 


Meta classification: We create 10 splits of the dataset of uncertainty metrics and 
computed JoU by randomly dividing it 10 times in half. We assign the TP label to 
those examples which have JoU > 0.5 and the FP label otherwise. On each split, we fit 
a gradient boosting classifier to the respective set of uncertainty metrics (see Fig. 5) on 
one half of the data. The resulting model is used to predict a classification probability 
on the uncertainty metrics of the second half of the data and vice-versa. This results 
in meta classification predictions on all data points. We evaluate these in terms of 
the previously introduced area-under-curve metrics under usage of the TP/FP labels 
generated from the ground truth, where we obtain averages and standard deviations 
from the 10 splits shown in Fig. 5. Performance measured in either of the two area- 
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under-curve metrics suggests that the models situated in the top-right separate TP 
from FP best. In all datasets, the meta classifiers MD, G, and combinations thereof 
outperform the network confidence by a large margin (upwards of 6 AuROC percent 
points (ppts) on DYOC!, 2.5 on DCCC?!” and 1.5 on DEM), with MC dropout being 


test 
among the weakest meta classifiers. Note that for both DYOC!? (top) and DCC!” 
(center), the standard deviation bars are occluded by the markers due to the size of 
the range. On DŠITI we can see some of the error bars and the models forming 
a strict hierarchy across the two metrics with overlapping MD+MC and G+MC. 
Overall, the three sources of uncertainty MC, MD, and G show a significant degree 
of non-redundancy across the three datasets with mutual boosts of up to 7 AuPR ppts 


VOC12 COCO17 KITTI 
on Dést > 4.5 on Dial , and 1.1 on DXA 


Meta fusion: We use the 10 datasets of uncertainty metrics and resulting confi- 
dences as in meta classification in combination with the bounding box information 
as new post-NMS output of the network. Thresholding on the respective confidence 
on decision thresholds £4ec € (0, 1) with a fixed step width of 0.01 reproduces the 
baseline object detection pipeline in the case of the score. The network predic- 
tion with assigned confidence can then be evaluated in terms of the aforementioned 
class-averaged performance measures for which we obtain averages and standard 
deviations from the 10 splits, except for the deterministic score baseline. In our com- 
parison, we focus on the score baseline, MC dropout baseline, output- and gradient- 
based uncertainty metrics (MD and G), and the full model based on all available 
uncertainty metrics (G+MD+MC). We show the resulting curves of mAP over mF 
in Fig. 6 on the right, where we see that the maximum mF’, value is attained by the 
confidence score. In addition, we observe a sharp bend in the score curve resulting 
from a large count of samples with a score between 0 and 0.01, where the interpo- 
lation to the threshold £, = 0 reaches to 0.912 mAP at a low 0.61 mF (outside the 
plotting range). In terms of mAP, however, the score is surpassed by the meta fusion 
models G and in particular the meta classification models involving MD (maximum 
mAP close to 0.92 at around 0.81 mF). The raw MetaDetect confidences seem 
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Fig. 6 Left: mAP over mF score for five meta fusion models computed from 10-fold cross- 
validation. We show standard deviation error bars obtained from cross-validation. Right: meta 
regression R? for uncertainty models on DYOC!?, Deol. and DHI The plot shows the averages 
over 10-fold cross-validation with standard deviation error bars hidden by the symbols 
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Fig. 7 Average ACE and mean calibration error MCE of the score and meta classification confi- 


dences for pocz PECOT, and pA (from left to right) 


to slightly improve over G+MD+MC in terms of mAP by about 0.3 ppts., which 
may be due to overfitting of the gradient boosting model, but could also result from 
the altered confidence ranking which is initially performed when computing mean 
average precision. Meta fusion based on MC dropout uncertainty shows the worst 
performance of all four methods considered in this test. 


Calibration: From the generated meta classification probabilities and the score, cal- 
ibration errors can be computed indicating their calibration as binary classifiers and 
compared; see Fig. 7 with standard deviations indicated by error bars. In addition, we 
show the Platt-scaled confidence score as a reference for a post-processing calibra- 
tion method. Naturally, ACE is smaller than MCE. The meta classification models 
show consistently far-smaller calibration errors (maximum MCE of 0.11 on DKY! 

than the confidence score (minimum MCE of 0.1 on oe with comparatively 
small miscalibration on the COCO validation dataset (center). We also observe that 
meta classifiers are also better calibrated than the Platt-scaled confidence score with 
the smallest margin on D$I", Note from the MCE that meta classification models 


show generally good calibration with a maximum ACE around 0.05 as indicated in 
Sect. 3 and by the example of Fig. 3. 


Meta regression: By an analogous approach to meta classification, we can fit gradient 
boosting regression models in a cross-validated fashion to sets of uncertainty metrics 
and the true maximum JoU with the ground truth. The resulting predicted JoU values 
can be evaluated in terms of R? which is illustrated in Fig. 6 on the left. Note once 
again that error bars for the standard deviation are covered by the markers. We observe 
an overall similar trend of the metrics to the meta classification performance in that 
combined models tend to perform better (boosts of up to 6 R? ppts.) indicating mutual 
non-redundancy among the different sources of uncertainty. The models based on 
MD and G individually show significant improvements over the confidence score 
(over 5 R? ppts. on all datasets) and a slight improvement over MC (up to 3 R? ppts.) 
for all investigated datasets. 


Parameter importance: We investigate which parameters are most important for 
meta classification in order to find the best performance with as few metrics as 


269 


Uncertainty Quantification for Object Detection ... 


OW+D+CIN | £860} OW+O+AN | 9860| OW+9+AWN OW+D+GW | 7060} OW+OHGW | £08'0/ OW+O+AWN | 9460 
. . d G . d y . 
CD gq am | 7860 | COo RM | 9860] A) ong CO gram | 7060| CN gues ay | €080 xw | 9460 
CO gpa | +860 ig | og60| CPH ONpis Cam onps | zoe] CO ett | z080 muy | $260 
(9ps | 860) CN hum | S860) (Ny hy CD gy p | 1060| — C™2)Mpis | 6620] CN) gy | SL6'0 
d | £860 xuy | çg60|  ("9)ONpis CO enn | 0060 mz | ç6L'0 s| 60 
X xuy | 7860 CP wpe" 860 : CO hat CN woa 6680 UNG Dy | 06L'0 : bi TL60 
Xew . Xew Xe ueoul . . 
CO etm | 1860) CeDo | €860| CN gay ggu CN euu | e680] Cn) ram | p8L0| CO opa | 1260 
CO a-pa | 0860 (QNpIs | 1860) CMe CO ipl! | 9880 (QWps | 690| CN hg! | 9960 
wo | 9460 wao | 8/60 (g) pis (QMps | C180] MAP) AHH | ovco) CN) eum | 6560 
s| 6960 s | 9960 s $| oeg0 s| 8590 (g)°Npis | L160 
udany 204V udany 204V any 204ny 
ILLIA 0209 320A 
ILLAC PYE ‘L10009 “zigoa SOSeIeP om 


JO YORI UO IIUIIIJOI IO} SIJQELVA-09 SE SOLNOUL JJe YUM S[OPOU! dy} JO SoURULIOJ.1od IY) MOYS IM 'UONLPILA-SSOID WOIJ POUIEIGO SIFLIIAL Iv PALIAPUI SININ 
“srojoueied UOY 6 ISI IY} 107 yoIeas Apoois e WO pOUWIQO YJdNY pue JOQYNY Joy ueyodw 1y) Jo SWIJ U! SuTyuRI owed saNR[NUIND 7 AWPL 


270 T. Riedlinger et al. 


possible. We pursue a greedy approach and evaluate single-metric models at first in 
terms of AuROC and AuPR in the cross-validated manner used before. We then fix the 
best-performing single metric and investigate two-metric combinations and gather 
the resulting metrics ranking with their area-under-curve score in Table 2. In addition 
to the nested models, we also show the model using all metrics which is also depicted 
in Fig. 5. Note that the numbers represent cross-validated averages, where we choose 
not to show standard deviations as we are mainly interested in seeing which metrics 
combine to yield well-separating meta classifiers. On all datasets, we find that a total 
of 9 metrics suffices to come close to or reach the performance of the best model in 
question. The network confidence appears as the most informative single metric 5 
out of 6 times. Another metric that leads to early good improvements is the related 
MC dropout standard deviation of the confidence score. Other noteworthy metrics 
are the left border coordinate Cyj, and gradient metrics for the class probability 
contribution J,. Overall, gradient metrics contribute a large amount of information 
to § and stdyc(S) which is indicative to meta classification performance. 


Conclusions and Outlook 


In this chapter, we have outlined two largely orthogonal approaches to quantifying 
epistemic uncertainty in deep object detection networks, the output-based MetaDe- 
tect approach, and gradient-based uncertainty metrics. Both compare favorably to the 
network confidence score and MC dropout in terms of meta classification and meta 
regression with the additional benefit that meta classifiers give rise to well-calibrated 
confidences irrespective of the set of metrics they are based on. Moreover, different 
sources of uncertainty information lead to additional boosts when combined, indicat- 
ing mutual non-redundancy. We have seen that the meta fusion approach enables us to 
trade uncertainty information for network performance in terms of mean average pre- 
cision with the best meta fusion models involving the MetaDetect metrics. In terms 
of raw meta classification capabilities, well-performing models can be achieved by 
only regarding a small fraction of all metrics, with gradient-based uncertainty met- 
rics contributing highly informative components. The design of well-performing 
meta classifiers and meta regression models opens up possibilities for further appli- 
cations. On the one hand, the implementation of meta classification models into an 
active learning cycle could potentially lead to a drastic decrease in data annotation 
costs. On the other hand, meta classification models may be applicable beneficially 
in determining the labeling quality of datasets and detecting labeling mistakes. 
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Detecting and Learning the Unknown R) 
in Semantic Segmentation gecit 


Robin Chan, Svenja Uhlemeyer, Matthias Rottmann, and Hanno Gottschalk 


Abstract Semantic segmentation is a crucial component for perception in auto- 
mated driving. Deep neural networks (DNNs) are commonly used for this task, and 
they are usually trained on a closed set of object classes appearing in a closed opera- 
tional domain. However, this is in contrast to the open world assumption in automated 
driving that DNNs are deployed to. Therefore, DNNs necessarily face data that they 
have never encountered previously, also known as anomalies, which are extremely 
safety-critical to properly cope with. In this chapter, we first give an overview about 
anomalies from an information-theoretic perspective. Next, we review research in 
detecting unknown objects in semantic segmentation. We present a method outper- 
forming recent approaches by training for high entropy responses on anomalous 
objects, which is in line with our theoretical findings. Finally, we propose a method 
to assess the occurrence frequency of anomalies in order to select anomaly types to 
include into a model’s set of semantic categories. We demonstrate that those anoma- 
lies can then be learned in an unsupervised fashion which is particularly suitable in 
online applications. 


1 Introduction 


Recent developments in deep learning have enabled scientists and practitioners to 
advance in a broad field of applications that were intractable before. To this end, 
deep neural networks (DNNs) are mostly employed which are usually trained in a 
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supervised fashion with closed-world assumption. However, when those algorithms 
are deployed to real-world applications, e.g., artificial intelligence (AI) systems used 
for perception in automated driving, they often operate in an open-world setting 
where they have to face diversity of the real world. Consequently, DNNs are likely 
exposed to data which is “unknown” to them and therefore possibly beyond their 
capabilities to process. For this reason, having methods at hand, that indicate when 
a DNN is operating outside of its learned domain to seek for human intervention, is 
of utmost importance in safety-critical applications. 

A generic term for such a task is anomaly detection, which is generally defined as 
recognizing when something departs from what is regarded as normal or common. 
More precisely, identifying anomalous examples during inference, i.e., new examples 
that are “extreme” in some sense as they lie in low density regimes or even outside of 
the training data distribution, is commonly referred to as out-of-distribution (OoD) or 
novelty detection in deep learning terminology. The latter is in close connection to the 
task of identifying anomalous examples in training data, which is contrarily known 
as outlier detection; a term originating from classical statistics to determine whether 
observational data is polluted. Those outlined notions are often used interchangeably 
in deep learning literature. Throughout this chapter, we will stick to the general term 
anomaly and only specify when distinguishing is relevant. 

For the purpose of anomaly detection, plenty of methods, ranging from classical 
statistical ones (see Sect.2) to deep-learning-specific ones (see Sect.4) have been 
developed in the past. Nowadays for the most challenging computer vision tasks 
tackled by deep learning, where both the model weights and output are of high 
dimension (in the millions), specific approaches to anomaly detection are mandatory. 

Classical methods such as density estimation fail due to the curse of dimension- 
ality. Early approaches identify outliers based on the distance to their neighbors 
[KNT00, RRSOO], i.e., they are looking for sparse neighborhoods. Other meth- 
ods consider relative densities to handle clusters of different densities, e.g., by 
comparing one instance either to its k-nearest neighbors [BKNS00] or using an 
e-neighborhood as reference set [PKGF03]. However, the concept of neighborhoods 
becomes meaningless in high dimensions [AHKO1]. More advanced approaches for 
high-dimensional data compute outlier degrees based on angles instead of distances 
[KSZ08] or even identify lower-dimensional subspaces [AYO1, KKSZ09]. 

In deep-learning-driven computer vision applications, novelties are typically 
regarded as more relevant than outliers. In semantic segmentation, i.e., pixel-level 
image classification, novelty detection may even refer to a number of sub-tasks. On 
the one hand, we might be concerned with the detection of semantically anomalous 
objects. This is also known as anomaly segmentation in the case of semantic segmen- 
tation. On the other hand, we also might be concerned with the detection of changed 
environmental conditions that are novel. The latter may be effects of a domain shift 
and include change in weather, time of day, seasonality, location and time. In this 
chapter, we focus only on semantically novel objects as anomalies. 

In general, an important capability of AI systems is to identify the unknown. How- 
ever, when striving for improved self-reflection capabilities, anomaly detection is not 
sufficient. Another important capability for real-world deployment of AI systems is 
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to realize that some specific concept appears over and over again and potentially 
constitutes a new (or novel) object class. Incremental learning refers to the task 
of learning new classes, however, especially in semantic segmentation, mostly in a 
strictly supervised or semi-supervised fashion where data for the new class is labeled 
with ground truth [MZ19, CMB+20]. This is accompanied by an enormous data col- 
lection and annotation effort. In contrast to supervised incremental learning, humans 
may recognize a novelty of a given class that appears over and over again very well, 
such that in the end a single feedback might be sufficient to assign a name to a novel 
class. For the task of image classification, [HZ21] provides an unsupervised exten- 
sion of the semantic space, while for segmentation there exist only approaches for 
supervised extension of the semantic space via incremental learning. 

In this chapter, we first introduce anomaly detection from an information-based 
perspective in Sect. 2. We provide theoretical evidence that the entropy is a suitable 
quantity for anomaly detection, particularly in semantic segmentation. In Sect. 3, 
we review recent developments in the fields of anomaly detection and unsupervised 
learning of new classes. We give an overview on existing methods, both in the con- 
text of image classification and semantic segmentation. In this setting, we present an 
approach to train semantic segmentation DNNs for high entropy on anomaly data 
in Sect.4. We compare our proposed approach against other established and recent 
state-of-the-art anomaly segmentation methods and empirically show the effective- 
ness of entropy maximization in identifying unknown objects. Lastly, we propose an 
unsupervised learning technique for novel object classes in Sect.5. Further, we pro- 
vide an outlook how the latter approach can be combined with entropy maximization 
to handle the unknown at run time in automated driving. 


2 Anomaly Detection Using Information and Entropy 


Anomaly detection is a common routine in any data analysis task. Before training a 
statistical model on data, the data should be investigated whether the underlying dis- 
tribution generating the data is polluted by anomalies. In this context, anomalies can 
generally be understood as samples that do not fit into a distribution. Such anoma- 
lous samples can, e.g., be generated in the data recording process either by extreme 
observations, by errors in recording and transmission, or by the fusion of datasets 
that use different systems of units. Most common for the detection of anomalies in 
statistics is the inspection of maximum and minimum values for each feature, or 
simple univariate visualization via box-whisker plots or histograms. 

More sophisticated techniques are applied in multivariate anomaly detection. 
Here, anomalous samples do not necessarily have to contain extreme values for 
single features, but rather an untypical combination of them. One of the application 
areas for multivariate anomaly detection is, e.g., fraud detection. 

In both outlined cases, an anomaly z € R? can be qualified as an observation that 
occurs at a location of extremely low density of the underlying distribution p(z) or, 
equivalently, has an exceptionally high value of the information 
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I(z) = —logp@). (1) 


Here, two problems occur: First, it is generally not specified what is considered as 
exceptionally high. Second, p(z) and thereby / (z) are generally unknown. Regarding 
the latter issue, however, the estimate T (z) = — log f(z) can be used which in turn 
relies on estimating p(z) from data associated to the probability density function 
p(z). Estimation approaches for f(z) can be distinguished between parametric and 
non-parametric ones. 

The Mahalanobis distance [Mah36] is the best known parametric method for 
anomaly detection which is based on information of the multivariate normal distri- 
bution N. In fact, ifz ~ N (u, ©) with mean u € Rf and positive definite covariance 
matrix © € R?*@, then 


= 1 1 Teal 
T(z) — log (so £)!/2 exp ( zZ H) > (z ai x) (2) 
d 1 1 Tal 1 2 
= < log(2x) + = log(det 5) + = (z — pE (Z — p) = -dz (z, pw) +c, 
2 2 2 2 
(3) 
where 
dy := V@— p= (z — p) (4) 


denotes the Mahalanobis distance. The estimation Î (z) is obtained by replacing p 
and È by the arithmetic mean jt and the empirical covariance matrix ` respectively, 
and likewise ds (z, 4) by the empirical Mahalanobis distance dẹ (Z, pL). 

In contrast, non-parametric techniques of anomaly detection rely on non- 
parametric techniques to estimate p(z). Here, a large variety of methods from his- 
tograms, kernel estimators and many others exist [Kle09]. We note, however, that the 
non-parametric estimation of densities and information generally suffers from the 
curse of dimensionality. To alleviate the latter issue in anomaly detection, estimation 
of information is often combined with techniques of dimensionality reduction, such 
as, e.g., principal component analysis [HTFO7] or autoencoders [SY 14]. 

When using non-linear dimensionality reduction with autoencoders, densities 
obtained in the latent space depend on the encoder and not only on the data itself. 
This points towards a general problem in anomaly detection. If p(z) is the density 
of a random quantity z and z’ = @(z) is an equivalent encoding of the data z using a 
bijective and differentiable mapping @ : R +> R4, the change of variables formula 
[Rud87, AGLR21] 


p(z’) = p(@) - | det (Vzz) | = p(z) - | det (Vb '()) | (5) 
implies that the information of z’ is 


I(z') = — log (p(z)) — log (|det (V207! (z))|) , (6) 
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where V,@ '(z) denotes the Jacobian matrix of the inverse function @~'. Thus, 
whenever a high value of /(z) indicates an anomaly, there always exists another 
equivalent representation of the data z’, where the information 7 (z’) is low. In other 
words, if z is remote from other instances z; of a dataset and therefore considered 
an anomaly, there will be a transformation z’ = (z) that brings z’ right into the 
center of the data zi; = $(z);. In fact, via the Rosenblatt transformation [Ros52] any 
representation z of the data can be expressed via a representation z’ = @(z) where 
T(z’) is constant over all data points. This stresses the importance to understand that 
an anomaly always refers to probability and encoding of the data z. This is true for 
both the original data and its approximated lower-dimensional representation. 

As a side remark, autoencoders designed from neural networks have been very 
successfully applied in anomaly detection. Encoder and decoder networks possess the 
universal approximation property [Cyb89]. Furthermore, common training losses like 
the reconstruction error are invariant under a change of the representation on latent 
spaces. Therefore, additional insights seem to be required to explain the empirical 
success of anomaly detection with autoencoders which is, however, not the scope of 
this chapter. 

Another way of looking at the issue of anomaly detection in the context of differ- 
ent representations of same data is an explicit choice of a reference measure. This 
reference measure represents to which extent, or how likely, data is contaminated 
by potential anomalies. Suppose we can associate the probability density p*"°"(z) 
to the reference measure, then we can base anomaly detection on the quotient of 
densities, i.e., the odds 2%, and apply a threshold whenever this ratio is low or, 


equivalently, when the relative information 


p(z) 


rel ae 
I (z) := — log (=. 


) = I (z) — 1") (7) 


is high. We note that the relative information is independent under changes of the 
representation z’ = @(z) as the — log | det(V,@7 | (z))| term from (6) occurs once 
with positive sign in /(z’) and once with negative sign in —/*"°"(z’) and therefore 
cancels. Thus, the choice of a reference measure and the choice of a representation 
for the data is largely equivalent. 

In practical situations, p™°™ (z) is often represented by some data {z7"°™};<7 that 
are either simulated or drawn from some data source of known anomalies. A binary 
classifier p (anom|z) can then be trained on basis of proper data {z; };-7 and anomalous 
data {z"°"}ie7:. The assumed prior probability p(anom) for anomalies, i.e., the 
degree of contamination, acts as a threshold for the estimated odds. Equivalently, the 
estimate of the relative information 


Fela) = — lo p(z) ) Bayes’ Theorem lo p(non-anom|z)p (z) RD) 
= 8 pm (z) 7 g p(non-anom) p(anom|z)p(z) 
(8) 
= € —P(anom|z) p(anom) ) (9) 


p(anom|z) 1 — p(anom) 
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= ips (- 7 | ie ( p(anom) ) (10) 
p(anom|z) 1 — p(anom) 

a log (ee) i (11) 
p(anom|z) 


p(anom) 
1—p(anom) 


with the prior log-odds c = — log ( ) being a parameter controlling the 


threshold for the binary classifier p(anom|z). 

If specifying what is an exceptionally high value for the information /(z) or 
relative information 7"! (z), the distinction between the detection of outliers in the 
training data and the detection of novelties during inference has to be taken into 
account. In outlier detection, observations, which have high (relative) information 
but which are in agreement with the extreme value of the (relative) information 


pm = max 1(z;) or p max __ max T(z), (12) 


are usually intentionally not eliminated. An outlier z for the level of significance 
0 <a < 1 can then be detected using the condition 


Pizer (™ > 1@) <a or Pes (T = I") <a. (13) 


Note again that the distribution of 7 rely j) has to be estimated to derive the associated 
distribution for the extreme values, see, e.g., [DHFO07], and also 7 rel (7) requires the 
estimation f(z) or P™°™ (z). Therefore, a quantification of the epistemic uncertainty 
is essential for a proper outlier detection. Given the already mentioned problems of 
density estimation in high dimension, epistemic uncertainties may play a major role, 
unless a massive amount of data is available. 

For the case of novelty detection taking place at inference, a comparison of the 
information of a single instance [ rel(7) with the usual distribution of information P; 
seems to be in order, which leads to the novelty criterion for level of significance 
0<a<l 


P, (1 (z;) > I) <a or Py, z) > T(z) < a. (14) 


As a variant to this criterion, /"*'(z;) could also be replaced by the extreme value 
statistics over the number of inferences alleviating the problem of generating false 
novelties by multiple testing. What has been stated on the necessity to quantify the 
epistemic uncertainty for the case of outlier detection equally applies for novelty 
detection. 

While anomaly detection is generally seen as a sub-field of unsupervised learning, 
some specific effects occur in the case of novelty detection in supervised learning. 
During the phase of inference, the data z = (y, x) contain an unobserved component 
y € S, which, e.g., represent the instance’s label in a classification problem for the 
classes contained in S. Using the decomposition p(z) = p(y, x) = p(y|x)p(x), one 
obtains the (relative) information from 


Detecting and Learning the Unknown in Semantic Segmentation 283 
T(z) = I(y|x) + T(x), or O = 10W +) a, (15) 


where J (y|x) = — log(p(y|x)), 77°" (y |x) = — log(p™°™ (y|x)) is the conditional 
information on the right hand side. Often, for the data of the reference measure 
p*"°™"(z), the labels are not contained in S. In this case, one uses a non-informative 
conditional distribution p*"°"(y|x) = oe If this is done, the last term of (15) 
becomes a constant that can be integrated into a threshold parameter. 

The (relative) information cannot be computed without knowing y. Therefore, 
the conditional expectation is used as an unbiased estimate, yielding the expected 
information 


EI (x) = Eyxpyy (I (@) = EE) + r'a) +5", (16) 


where E(x) = Nose s P(yIx)/ (|x) is the expected information, or entropy, of the 
conditional distribution p(y|x) and b™! is zero for the information and equal to 
— log(|S|) for the relative information with non-informative conditional distribution 
p*"°"" (y|x). Note that E(x) is bounded by log(|S|). Therefore, under normal cir- 
cumstances, the term /"!(x) will outweigh E(x) by far. However, in problems like 
semantic segmentation, each component of x is assigned a label from S. This implies 
solving |Z| classification problems, where Z denotes the pixel space of x, thus the 
maximum value for E(x) yields |Z| log(|S]). 

Therefore, the first term in (16) contains significant contributions, especially in 
situations where |Z] is large. The second term, /'*!(x) loses importance under the 
hypothesis that the probability of the inputs x does not vary greatly. Despite this 
hypothesis could be supported by fair sampling strategies, it requires further critical 
evaluation. But at least to a significant part, the expected information as an anomaly 
measure with regard to instance x is given by a dispersion measure, namely the 
entropy of the conditional probability. As the entropy can be well estimated using 
a supervised machine learning approach to estimate p(y|x) from the data {z;} jer, 
this part of the information is well accessible in contrast to J"*!(x), which requires 
density estimation in high dimension. 

Lastly in this section, let us give a remark on the role of anomaly data {27°°™} jer: = 
{xj} jet". If such data is available, it is desirable to train the machine learning 
model p(y|x) to produce high values for E (x7) so that the tractable part of the 
expected information EJ (x) shows good separation properties. This requirement 
can be inserted to the loss function, as it has been proposed in [HAB19, HMKS19] 
for classification. In fact, as the entropy E(x) is maximized by the uniform (non- 
informative) label distribution p(y xine") = Toe the aforementioned loss will favor 
this prediction on anomalous inputs {x}"°™} jez. In this chapter, in Sect. 4, we will 
extend this approach to the computer vision task of semantic segmentation, after 
having reviewed related works based on deep learning in the following section. 
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3 Related Works 


After the introduction to anomaly detection from a theoretical point of view, we now 
turn to anomaly detection in deep learning. In this section, we review research in the 
direction of detecting and learning unknown objects in semantic segmentation. 


3.1 Anomaly Detection in Semantic Segmentation 


An emerging body of work explores the detection of anomalous inputs on image 
data, where the task is more commonly referred to as anomaly or out-of-distribution 
(OoD) detection. Anomaly detection was first tackled in the context of image classi- 
fication by introducing post-processing techniques applied to softmax probabilities 
to adjust the confidence values produced by a classification model [HG17, LLLS 18, 
LLS18, HAB19, MH20]. These methods have proven to successfully lower con- 
fidence scores for anomalous inputs at image-level, which is why they were also 
adapted to anomaly detection in semantic segmentation [ACS19, BSN+19], i.e., 
to anomaly segmentation by treating each single pixel in an image as a potential 
anomaly. Although those methods represent good baselines, they usually do not gen- 
eralize well to segmentation, e.g., due to the high prediction uncertainties at object 
boundaries. The latter problem can, however, be mitigated by using segment-wise 
prediction quality estimates [RCH+20], an approach which has also demonstrated 
to indicate anomalous regions within an image [ORF20]. 

Recent works have proposed more dedicated solutions to anomaly segmentation. 
Among the resulting methods, many originate from uncertainty quantification. The 
intuition is that anomalous regions in an image correlate with high uncertainty. In this 
regard, early approaches estimate uncertainty using Bayesian deep learning, treating 
model parameters as distributions instead of point estimates [Mac92, Nea96]. Due 
to the computational complexity, approximations are mostly preferred in practice, 
which comprise, e.g., Monte-Carlo dropout [GG16], stochastic batch normaliza- 
tion [AAM+19], or an ensemble of neural networks [LPB17, GDS20]; with some 
of them also being extended to semantic segmentation in [BKC17, KG17, MG19]. 
Even when using approximations, Bayesian models still tend to be computationally 
expensive. Thus, they are not well suited to real time semantic segmentation which 
is required for safe automated driving. 

This is why tackling anomaly segmentation with non-Bayesian methods are more 
favorable from a practitioner’s point of view. Some approaches therefore include 
tuning a previously trained model to the task of anomaly detection, by either mod- 
ifying its architecture or exploiting additional data. In [DT18], anomaly scores are 
learned by adding a separate branch to the neural network. In [HMD19, MH20] the 
network architecture is not changed but auxiliary outlier data, which is disjoint from 
the actual training data, is induced into the training process to learn anomalies. The 
latter idea motivated several works in anomaly segmentation [BSN+19, BKOS19, 
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JRF20, CRG21]. Nonetheless, such models have to cope with multiple tasks, hence 
possibly leading to a performance loss with respect to the original semantic seg- 
mentation task [VGV+21]. Moreover, when including outlier datasets in the training 
process, it cannot be guaranteed that the chosen outlier data is a good proxy for all 
possible anomalies. 

Another recent line of works performs anomaly segmentation via generative mod- 
els that reconstruct original input images. These methods assume that reconstructed 
images will better preserve the appearance of known image regions than that of 
unknown ones. Anomalous regions are then identified by means of pixel-wise dis- 
crepancies between the original and reconstructed image. Thus, such an approach 
is specifically designed to anomaly segmentation and has been extensively studied 
in [CM15, MVD17, LNFS19, XZL+20, LHFS21, BBSC21]. The main benefit of 
these approaches is that they do not require any OoD training data, allowing them 
to generalize to all possible anomalous objects. However, all these methods are lim- 
ited by the integrated discrepancy module, i.e., the module that identifies relevant 
differences between the original and reconstructed image. In complex scenes, such 
as street scene images for automated driving, this might be a challenging task due to 
the open world setting. 

Regarding the dataset landscape, only few anomaly segmentation datasets exist. 
The LostAndFound dataset [PRG+16] is a prominent example which contains 
anomalous objects in various streets in Germany while sharing the same setup as 
Cityscapes [COR+16]. LostAndFound, however, considers children and bicycles as 
anomalies, even though they are part of the Cityscapes training set. This was fil- 
tered and refined in Fishyscapes [BSN+19]. Another anomaly segmentation dataset 
accompanies the CAOS benchmark [HBM+20], which considers three object classes 
from BDD100k [YCW+20] as anomalies. Both, Fishyscapes and CAOS, try to mit- 
igate low diversity by complementing their real images with synthetic data. 

Efforts to provide anomalies in real images have been made in [LNFS 19] by sourc- 
ing and annotating street scene images from the web and in [LHFS21, SKGK20] by 
capturing and annotating images with small objects placed on the road. Just recently, 
the datasets published alongside the SegmentMelIfYouCan benchmark [CLU+21] 
build upon those works, particularly contributing to broad diversity of anomalous 
street scenes as well as objects. 


3.2 Incremental Learning in Semantic Segmentation 


Building upon the detection of anomalies, training data can be enriched in order to 
learn novel classes. To avoid training from scratch, several approaches tackle the task 
of incremental or even continuous learning, which can be understood as adapting 
to continuously evolving environments. Besides learning novel classes, incremental 
learning also encompasses adapting to alternative tasks or other domains. A com- 
prehensive framework to compare these different learning scenarios is provided in 
[vdVT19]. 
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When learning novel classes, the primary issue incremental learning approaches 
face is a loss of the original performance on previously learned classes, that is com- 
monly known as catastrophic forgetting [MC89]. To overcome this problem, a model 
needs to be both, “stable” and “plastic”, i.e., the model needs to retain its original 
knowledge while being able to adapt to new environments. The complexity of meet- 
ing these requirements at the same time is called the stability-plasticity-dilemma 
[AROS]. In this regard, proposed solution strategies can be separated into three cat- 
egories, which are either based on architecture, regularization, or rehearsal. Most of 
these methods have been applied to image classification first. 

Architecture strategies employ separate models for each sequential incremental 
learning task, combined with a selector to determine which model will be used for 
inference [PUUH01, CSMDB19, ARC20]. However, these approaches suffer from 
data imbalances, consequently standard classification algorithms tend to favor the 
majority class. Approaches to mitigate skewed data distributions are usually based on 
over- or undersampling. Another line of works, such as [RRD+16, RPR20], employ 
“growing” models, i.e., enlarging the model capacity by increasing the number of 
model parameters for more complex tasks. In [ACT 17], the authors propose an auto- 
mated approach to select the proper task-specific model at test time. More efficient 
approaches were introduced in [GK16, YYLH18], that restrict the adaptation of 
parameters to relevant parts of the model in terms of the new task. The Sel f-Net 
[MCE20] is made up of an autoencoder that learns low-dimensional representations 
of the models belonging to previously learned tasks. By that, retaining existing knowl- 
edge via approximating the old weights instead of saving them directly is accom- 
panied with an implicit storage compression. The incremental adaptive deep model 
developed in [YZZ+19] enables capacity scalability and sustainability by exploiting 
the fast convergence of shallow models at the initial stage and afterwards utilizing 
the power of deep representations gradually. Other procedures perform continuous 
learning, e.g., using a random-forest [HCP+19], an incrementally growing DNN, 
retaining a basic backbone [SAR20], or nerve pruning and synapse consolidation 
[PTJ+21]. 

Regularization strategies can be further distinguished between weight regulariza- 
tion, i.e., measuring the importance of weights, and distillation, i.e., transferring a 
model’s knowledge into another. The former identifies parameters with great impact 
on the original tasks that are suppressed to be updated. Elastic weight consolida- 
tion (EWC) [KPR+17] is one representative method, evaluating weight importance 
based on the Fisher information matrix, while the synaptic intelligence (SI) method 
[ZPG17] calculates the cumulative change of Euclidean distance after retraining 
the model. Both regularization methods were further enhanced, e.g., by combining 
them [CDAT18, AM19], or by including unlabeled data [ABE+18]. Another idea to 
maintain model stability was adapted in [ZCCY21, FAML19], updating gradients 
based on orthogonal constraints. Bayesian neural networks are applied in [LKJ+17] 
to approximate a Gaussian distribution of the parameters from a single to a combined 
task. 

Distillation is a regularization method, where the knowledge of an old model 
can be drawn into a new model to partly overcome catastrophic forgetting. Knowl- 
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edge distillation, proposed in [HVD 14], was originally invented to transfer knowl- 
edge from a complex into a simple model. The earliest approach, which applies 
knowledge distillation to incremental learning, is called learning without forgetting 
(LwF) [LH18]. A combination of knowledge distillation and EWC was proposed in 
[SCL+18]. Further approaches based on distillation loss are e.g., [JJJK18, YHW+19, 
KBJC19, LLSL19]. 

Rehearsal or pseudo-rehearsal-based methods, which were already proposed in 
[Rob95], mitigate catastrophic forgetting by allowing the model to review old knowl- 
edge whenever it learns new tasks. While rehearsal-based methods retain a subset of 
the old training data, pseudo-rehearsal strategies construct a generator during retrain- 
ing, which learns to produce pseudo-data as similar to the old training data as possible. 
Hence, they provide the advantages of rehearsal even if the previously learned infor- 
mation is unavailable. Methods reusing old data are, e.g., incremental classifier and 
representation learning (iCaRL) [RKSL17], which simultaneously learns classifiers 
and feature representation, or the method presented in [CMJM-+18], which proposes a 
representative memory. The bias correction (BiC) method [WCW+19] keeps old data 
in a similar manner, but handles the data imbalance differently. Most pseudo-rehearsal 
approaches include generative adversarial networks (GANs) [OOS17, WCW+18, 
MSC+19, OPK+19] or a variational autoencoder (VAE) [SLKK17]. The method 
presented in [HPL+18] combines distillation and retrospective (DR), whereby base- 
line approaches such as LwF are outperformed by a large margin. 

Only few works exist, such as [KBDFs20, MZ21, TTA19], that adapt incremental 
learning techniques to semantic segmentation. They adjust knowledge distillation, 
using no or only a small portion of old data, respectively. One challenge of continuous 
learning for semantic segmentation is that images may contain unseen as well as 
known classes. Hence, annotations that are restricted to some task assign a great 
amount of pixels to a background class, exhibiting a semantic distribution shift. 
The authors of [CMB+20] provide a framework that mitigates biased predictions 
towards this background class. While existing approaches require supervision, we 
employ incremental learning in a semi-supervised fashion, as we do not have access 
to any ground truth including novelties. 


4 Anomaly Segmentation 


The task of anomaly detection in the context of semantic segmentation, i.e., identi- 
fying anomalies at pixel-level, is commonly known as anomaly segmentation. For 
this task several approaches have been proposed that are either based on uncer- 
tainty quantification, generative models, or training strategies specifically tailored to 
anomaly detection. In this chapter, we will first review some of those well-established 
methods and, subsequently, report a performance comparison with respect to their 
capability of identifying anomalies. In particular, we will demonstrate empirically 
that entropy maximization yields great performance on this segmentation task, which 
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is in accordance to the statement of the entropy’s importance from the information- 
based perspective as presented in Sect. 2. 


4.1 Methods 


Letx €e 14*"*3 I = [0, 1], denote (normalized) color images of resolution H x W. 
Feeding those images to a semantic segmentation network F : 14*W*3 — R4xWxs, 
the model produces pixel-wise class scores y = (yj,s)ieT.ses = F(x) € R?*"*5, 
with the set of pixel locations denoted by Z = {1,..., H} x {1,..., W} and the 
set of trained (hence known) classes denoted by S = {1,..., S}. The correspond- 
ing predicted segmentation mask is given by m = (m,)jez € {1,..., S}”*", where 
for m; = arg maXxses Yi,s V i € Z the maximum a-posteriori probability principle is 
applied. Regarding the task of anomaly segmentation, the ultimate goal is then to 
obtain a score map a = (ai)iez E€ R”™*™ that indicates the presence of an anomaly 
at each pixel location i € Z within image x, i.e., the higher the score the more likely 
there should be an anomaly. 

Each of the methods employed in this section provides such score maps. Their 
underlying segmentation networks (DeepLabV3+, [CZP+18]) are all trained on 
Cityscapes [COR+16], i.e., objects not included in the set of Cityscapes object classes 
are considered as anomalies since they have not been seen during training and thus 
are unknown. The anomaly detection methods, however, differ in the way how the 
scores are obtained, which is why we briefly introduce the different techniques in 
the following. 


Maximum softmax probability: The most commonly-used baseline for anomaly 
detection at image level is thresholding at the maximum softmax probability (MSP) 
[HG17]. Therefore, this method assumes that anomalies are attached a low confidence 
or, equivalently, high uncertainty. Using MSP in anomaly segmentation, the score 
map is computed via 


q,=1- max softmax (yi) Viel. (17) 
SE 


ODIN: A simple extension to improve MSP is applying temperature scaling as 
well as adding perturbations, which is known as out-of-distribution detector for 
Neural networks (ODIN) [LLS18]. In more detail, let t € R~o be a hyperparameter 
for temperature scaling and let ¢ € R>ọ be a hyperparameter for the perturbation 
magnitude. Then, the input x is modified as 


0 i 
X=(X))iez with x; = x; — sign -+ log max softmax (5) Viel, 
Xi 


SE 
(18) 
yielding prediction y = F(x) for which thresholding is applied at the MSP, i.e., the 
anomaly score map is given by 
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q,=1- max softmax(¥;) VieTZ. (19) 
sE 


Mahalanobis distance: This anomaly detection approach estimates how well latent 
features fit to those observed in the training data. Let (L — 1) denote the penultimate 
layer of a network F with L layers. In [LLLS18] the authors have shown that training a 
softmax classifier fits a class-conditional Gaussian distribution for the output features 
f; _,. Hence, under that assumption 


Py 15; =e N (ye "| tes E ) Viel, (20) 


where y“~!) = fr- (x) e R4*"*Cz-1 denotes the feature map of the penultimate 
layer given input x, and y the corresponding one-hot encoded final target. The minimal 
Mahalanobis distance dy, (x, w,) is then an obvious choice for an anomaly score map 


(L-1) 


a; = mindy, (x, ps) = ming; = AD E Y ~My) VieZ, (2D 


cf. (2). Note that the class means u, € R®+- and class covariances £, € RCr-!xCz-1 
are generally unknown, but can be estimated by means of the training dataset. 


Monte-Carlo dropout: In semantic segmentation, Monte Carlo dropout represents 
the most prominent technique to approximate Bayesian neural networks. According 
to [MG19], (epistemic) uncertainty is measured as the mutual information which 
might serve as anomaly score map, i.e., 


ai = -D(z Eat?) os (5 Yip 2-3 TD VieTZ, 


seS reR reER RS reR 
(22) 


with p” = (PY) ses = softmax(y!”’) in the sampling round r € R= {1,... R}. 
Typically, 8 < R < 12. 


Void classifier: Neural networks can be trained to output confidences for the presence 
of anomalies [DT18]. One approach in this context is adding an extra class to the 
set S of previously trained classes of a semantic segmentation network, which then 
also requires annotated anomaly data to learn from. To this end, the void class in 
Cityscapes is a popular choice as proxy for all possible anomaly data [BSN+19], in 
particular if the segmentation model was originally trained on Cityscapes. Thus, the 
softmax output of the additional class s = S + 1 represents the anomaly score map, 
i.e., 

= softmax;-ssi(y;) Viez, (23) 


where y = (yj)ier = (¥),,)iez,sett,....541} = F), F : 12X43 > RAXWXETD, 
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Learned embedding density: Let f;(x) € R‘*™* denote the feature map, or 
equivalently feature embedding, at layer £ € £ = {1,..., L} of a semantic segmen- 
tation network. By employing normalizing flows, the true distribution of features 
p(fe(x)) € 1%, wherex € X tain is drawn from the training dataset, can be trained 
via maximum likelihood, i.e., normalizing flows learn to produce the approximation 
p(fc(x)) © p(fe(x)) [BSN+19]. At test time, the negative log-likelihood measures 
how well features of a test sample fit to the feature distribution observed in the 
training data, yielding the anomaly score map 


a = up™ (—log p(f:(x))) (log applies log element-wise) (24) 


with up!" : R4*We —> R4*W denoting (bi-)linear upsampling. 


Image resynthesis: After obtaining the predicted segmentation mask m € 
{1,..., S}7*", m = (m;) jez, this output can be further processed by a generative 
modelG : {1,..., S}4*” — I4**3 aiming to reconstruct the original input image, 
i.e., x’ = G(m) ~ x. This process is also called image resynthesis, and the intuition 
is that reconstruction quality for anomalous objects is worse than for those on which 
the generative model is trained on. To determine pixel-wise anomalies, a discrepancy 
network [LNFS19] D : {1,..., S}#*" x [4x43 x JHxWx3 _, RAW can then be 
employed, which classifies whether one pixel is anomalous or not, based on infor- 
mation provided by m, x’, and x. Here, D is trained on intentionally triggered clas- 
sification mistakes that are produced by flipping classes on predicted segmentation 
masks. The anomaly score map is given by the output of the discrepancy network, 
i.e., 


a = Dim, x’, x) = D(m, G(m), x). (25) 


SynBoost: The image resynthesis approach is limited by the employed discrepancy 
module D. In [BBSC21], the authors proposed to extend the discrepancy network 
by incorporating further inputs based on uncertainty, such as the pixel-wise softmax 
entropy 
H; (x) = — 5 softmax, (y;) log(softmax, (y;)) Vi € Z, (26) 
seS 


and the pixel-wise softmax probability margin 


M;(x) =1- max (softmax(y;)) + m , (softmax(y:)) Viel. (27) 
SE se Mi 


Furthermore, D is trained on anomaly data provided by the Cityscapes void class. 
Thus, the anomaly score map is given by 


a = Dim, x’, x, H(x), M(x)) = D(m, G(m), x, H(x), M(x)) . (28) 


with H(x) = (H; &))iez and M(x) = (Mj(X))iez- 
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Entropy maximization: A desirable property of semantic segmentation networks is 
that they attach high prediction uncertainty to novel objects. To this end, the softmax 
entropy, see (26), is one intuitive uncertainty measure. The segmentation network 
can be trained for high entropy on anomalous inputs via the multi-criteria training 
objective [CRG21] 


Jl = (1 — NEg.»)~p[J (F(x), ¥)] + AEx~pamom [J (F(x))] , (29) 


where D denotes non-anomaly training data (labels available) and D*"°™ denotes 
anomaly training data (no labels available). In this approach, the COCO dataset 
[LMB+14] represents a set of so-called known unknowns, which is used as proxy 
for D®%™ with the aim to represent all possible anomaly data. Moreover, A € I is a 
hyperparameter controlling the impact of each single loss function on the overall loss 
Jl For non-anomaly data, the loss function is chosen to be the commonly-used 
cross-entropy JE, while for anomaly data, i.e., for known unknowns, we have 


1 1 
J" (F(x)) = rae 3 > log softmax;(y;), x~D™™. (30) 
ieL seS 


Therefore, minimizing J*"°™ is equivalent to maximizing the softmax entropy since 
both reach their optimum when the softmax probabilities are uniformly distributed, 
i.e., softmax, (y;) = i Vs € S,i € T. After training, the anomaly score map is then 
given by the (normalized) softmax entropy 


1 1 
ai = —— H; (x) = —-—— X softmax, (y; ) log(softmax, (y;)) Viez. (31) 
log S log S = 


From an information-based point of view, the entropy contains significant contribu- 
tion to the expected information. This particularly applies for instance predictions in 
semantic segmentation, which motivates the entropy maximization approach for the 
detection of unknown objects, cf. Sect. 2. 


4.2 Evaluation and Comparison of Anomaly Segmentation 
Methods 


Discriminating between anomaly and non-anomaly is essentially a binary classifi- 
cation problem. In order to evaluate the pixel-wise anomaly detection capability, we 
use the receiver operating characteristic (ROC) curve as well as the precision recall 
(PR) curve. While for the ROC curve the true positive rate is plotted against the false 
positive rate at varying thresholds, in the PR curve precision is plotted against recall 
at varying thresholds. Note that we consider anomalies as the positive class, i.e., 
correctly identified anomaly pixels are considered as true positive. In both curves, 
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the degree of separability is then measured by the area under the curve (AUC), where 
better separability corresponds to a higher AUC. 

The main difference between these two performance metrics is how they cope with 
class imbalance. While the ROC curve incorporates the number of true negatives (for 
the computation of the false positive rate), in PR curves true negatives are ignored and, 
consequently, more emphasis is put on finding the positive class. With the anomaly 
score maps as defined in Sect. 4.1, in our case, finding the positive class corresponds 
to identifying anomalies. 

As evaluation datasets, we use LostAndFoundNoKnown [BSN+19] and RoadOb- 
stacle21 [CLU+21], which are both part of the public SegmentMelfYouCan anomaly 
segmentation benchmark.' LostAndFoundNoKnown consists of 1043 road scene 
images where obstacles are placed on the road. This dataset is a subset of the promi- 
nent LostAndFound dataset [PRG+16] but considers only obstacles from object 
classes which are disjoint to those in the Cityscapes labels [COR+16]. More precisely, 
images with humans and bicycles are removed such that the remaining obstacles in 
the dataset also represent anomalies to models trained on Cityscapes. Similar scenes 
can be found in RoadObstacle21. That dataset was published alongside the Segment- 
MelfYouCan benchmark and contains 327 road obstacle scene images with diverse 
road surfaces as well as diverse types of anomalous objects. Both datasets restrict the 
region of interest to the road where anomalies appear. This task is extremely safety- 
critical as it is mandatory in automated driving to make sure that the drivable area is 
free of any hazard. All anomaly segmentation methods introduced in the preceding 
Sect. 4.1 are suited to be evaluated on these datasets. We provide a visual comparison 
of anomaly scores produced by the tested methods in Fig. 1. We report numerical 
results in Fig. 2 and in the corresponding Table 1. 

In general, we observe that anomaly detection methods originally designed for 
image classification, including MSP, ODIN and Mahalanobis, do not generalize well 
to anomaly segmentation. As the Mahalanobis distance is based on statistics of the 
Cityscapes dataset, the anomaly detection is likely to suffer from performance loss 
under domain shift. The same holds for Monte Carlo dropout and learned embedding 
density, particularly resulting in poor performance in RoadObstacle21, where various 
road surfaces are available. Therefore, those methods potentially act as domain shift 
classifier rather than as detector of unknown objects. 

The detection methods based on autoencoders, namely image resynthesis and 
SynBoost, show to be better suited for the task of anomaly segmentation, clearly 
being superior to all the approaches that already have been discussed. Autoencoders 
are limited by their discrepancy module, and we observe that anomaly detection 
performance significantly benefits from incorporating uncertainty measures, as done 
by SynBoost. Only entropy maximization reaches similar anomaly segmentation 
performance, even outperforming SynBoost in RoadObstacle21. This again can be 
explained by the diversity of road surfaces, which detrimentally affects the discrep- 
ancy module. 


1 www.segmentmeifyoucan.com. 


Detecting and Learning the Unknown in Semantic Segmentation 293 


[Mahalanobis] [MC dropout] 
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[Image resynthesis] [SynBoost] [Entropy max. ] 


Fig.1 Qualitative comparison of anomaly score maps for one example image of RoadAnomaly21. 
Here, red indicates high anomaly scores while blue indicates low ones. The ground truth anomaly 
instance is highlighted by green contours. Note that the region of interest is restricted to the road, 
highlighted by red contours in the annotation 


As a final remark, we draw attention to the use of anomaly data. The void 
classifier follows the same intuition as entropy maximization by including known 
unknowns, but cannot reach nearly as good anomaly segmentation performance. We 
conclude that the COCO dataset is better suited as proxy for anomalous objects than 
the Cityscapes unlabeled objects. Moreover, the results of that method empirically 
demonstrate the impact of the entropy in anomaly segmentation, which is in accor- 
dance to the statement of the entropy’s importance from the information perceptive 
described in Sect. 2. 


4.3 Combining Entropy Maximization and Meta 
Classification 


Meta classification is the task of discriminating between a true positive prediction 
and a false positive prediction. For semantic segmentation, this idea was originally 
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Fig. 2 Receiver operating characteristic (left column) and precision recall (right column) curves 
for LostAndFoundNoKnowns (top row) and RoadObstacle21 (bottom row), respectively. Dashed 
red lines indicate the performance of random guessing, i.e., the “no-skill” baseline. The degree of 
separability between anomaly and non-anomaly is measured by the area under the curve 


proposed in [RCH+20]. By means of hand-crafted metrics, which are based on disper- 
sion measures, geometry features, or location information, all derived from softmax 
probabilities, meta classifiers have shown to reliably identify incorrect predictions 
at segment level. More precisely, connected components of pixels sharing the same 
class label are considered as segments in this context, and a false positive segment 
then corresponds to a segment-wise intersection-over-union (IoU) of 0. 

The meta classification approach can straightforwardly be adapted to post-process 
anomaly segmentation masks. This seems particularly reasonable in combination 


Detecting and Learning the Unknown in Semantic Segmentation 295 


Table 1 Pixel-wise anomaly detection performance on the datasets LostAndFoundNoKnown and 
RoadObstacle21, respectively. The main evaluation metric represents the area under precision-recall 
curve (AuPRC). Moreover, the area under receiver operating characteristic (AuROC) and the false 
positive rate at a true positive rate of 95% (FPRostpr) are reported for further insights 


Method LostAndFoundNoKnown RoadObstacle21 

AuPRC ¢ AuROC t FPRostTpr 4 | AuPRC t AuROC ¢ FPRo5TPR 4 
Maximum 30.1 93.0 33.2 10.0 95.5 17.9 
Softmax 
ODIN 52.9 95.1 30.0 11.9 96.0 16.4 
Mahalanobis | 55.0 97.5 12.9 19.5 95.1 21.7 
Monte Carlo | 36.8 92.2 35.5 4.9 83.5 50.3 
Dropout 
Void 4.8 79.5 47.0 10.4 89.7 41.5 
classifier 
Embedding | 61.7 98.0 10.4 0.8 81.0 46.4 
density 
Image 42.7 96.4 17.4 37.5 98.6 4.7 
resynthesis 
SynBoost 81.7 98.3 4.6 71.3 99.4 3.2 
Entropy 77.9 98.0 9.7 76.0 99.7 1.3 
maximiza- 
tion 


with entropy maximization. Since entropy maximization generally increases the sen- 
sitivity towards predicting anomalies, it is possible that the entropy is also increased 
at pixels belonging to non-anomalous objects. In the latter case, this would yield 
false positive anomaly instance predictions, which, however, can be identified and 
discarded afterwards by meta classification. The concept of trading false-positive 
detection for anomaly detection performance is motivated by [CRH+20]. Moreover, 
meta classifiers are expected to considerably benefit from entropy maximization, 
since in the original work [RCH+20] the entropy as metric has already been observed 
to be well correlated to the segment-wise IoU. 

In our experiments on LostAndFound [PRG+16], we employ a logistic regres- 
sion as meta classifier that is applied as a post-processing step on top of softmax 
probabilities. We observe that the meta classifier is capable of reliably removing 
false-positive anomaly instance predictions, which in turn significantly improves 
detection performance of anomalous objects. The meta classification performance is 
reported in Table 2, a visual example is given in Fig. 3. We note that meta classification 
is applied to segmentation masks as input. Therefore, the output of the combination 
of entropy maximization and meta classification does not yield pixel-wise anomaly 
scores to compare against the methods presented in Sect. 4.1. 

The idea of meta classification can even be used to directly identify potential 
anomalous objects in the semantic segmentation mask, see [ORG18], which will 
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Fig. 3 Meta classification as quality rating of anomaly instance predictions. Before applying the 
meta classifier (top), the anomaly segmentation mask contains anomaly instance predictions (orange 
segments), with some false-positives on the road. Based on softmax probabilities, the meta classifier 
performs a prediction quality rating (bottom left, red corresponds to poor quality), which is then 
used to remove false positive anomaly instance predictions (bottom right). Note that the region of 
interest is restricted to the road, where ground truth anomalous objects (or obstacles) are indicated 
by green contours 


be subject to discussion in the following section about unsupervised learning of 
unknown objects. 


5 Discovering and Learning Novel Classes 


If certain types of anomalies appear frequently, it might be reasonable to include 
them as additional learnable classes of the segmentation model. In this section, we 
propose an unsupervised method to further process anomaly predictions, all with the 
goal to produce labels corresponding to novel classes. Afterwards, we will introduce 
an incremental learning approach to train a model on novel classes by means of the 
retrieved unsupervised labels. 


5.1 Unsupervised Identification and Segmentation of a Novel 
Class 


Consider the dataset D€ C X of unlabeled images x = (x;)iez € 17 See along 


with a semantic segmentation network F : I#*W*3 —> R4**S trained on the set of 
classes S = {1,..., S}. Moreover, let a = (ai)iez € R”*™ denote a score map, as 
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introduced in Sect. 4, which assigns the degree of anomaly to each pixel i € Z in x. 
Our unsupervised anomaly segmentation technique is a three-step procedure: 


1. Image embedding: Image retrieval methods are commonly applied to construct a 
database of images that are visually related to a given image. On that account, such 
methods must quantify visual similarities, i.e., to measure the discrepancy or “dis- 
tance” between images. A simple idea is averaging over the pixel-wise differences. 
However, this approach is extremely sensitive towards data transformation such as 
rotation, variation in light, or different resolutions. More advanced approaches make 
use of visual descriptors that extract the elementary characteristics of the visual con- 
tents, e.g., color, shape, or texture. These methods are invariant to data transformation, 
i.e., they perform well in identifying images representing the same item. If we want 
to detect different instances of the same category, deep learning methods represent 
the state-of-the-art. In this regard, convolutional neural networks (CNNs) achieve 
very high accuracy in image classification tasks. These networks extract features 
of the images, that are stable regarding transformations as well as the represented 
object itself, i.e., objects of the same category result in similar feature vectors. We 
now adapt this idea to identify anomalies that belong to the same class. 

Let Kax denote the set of connected components within (aP ier, a = liner) 
Vi € T fora given threshold t € R, after processing image x. Furthermore, let K := 
U,ex Kax denote the set of all predicted anomaly components in D'*'. For each 
component k € Kajx, we tailor the input x to the image crop x = (x))jer, TD’ CT 
by means of the bounding box around k € Kajx. By feeding the crop x to an image 
classification network G, we mapx™ onto its feature vectorg™ := G,_\(x™) € R”, 
n € N forall k € K. Here, G;_; denotes the output of the penultimate layer of G. 


2. Dimensionality reduction: Feature vectors extracted by CNNs are usually very 
high-dimensional. This evokes several problems regarding the clustering of such 
data. The first issue is known as curse of dimensionality, i.e., the amount of required 
data explodes with increasing dimensionality. Furthermore, distance metrics become 
less precise. Dimensionality reduction approaches project the feature vectors onto a 
low-dimensional representation, either by feature elimination, selection, or extrac- 
tion. The latter creates new independent features as a combination of the original 
vectors and can be further distinguished between linear and non-linear techniques. 
A linear feature extraction approach, named principal component analysis (PCA) 
[Pea01], aims at decorrelating the components of the vectors by a change of basis, 
such that they are mostly aligned along the first axes. Thereby, not much informa- 
tion is lost if we drop the last components. A more recent non-linear method is 
t-distributed stochastic neighbor embedding (t-SNE) [vdMH08], which uses con- 
ditional probabilities representing pairwise similarities. Let us consider two feature 
vectors g, g% with k, k’ € K and let pyy € I denote their similarity under a Gaus- 
sian distribution. Employing a Student t-distribution with one degree of freedom in 
the low-dimensional space then provides a second probability qw € I. Hence, t-SNE 
aims at minimizing the following sum (or Kullback-Leibler divergence) [vaMH08] 
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>> SS pew log (Pus ) (32) 


keK k'eK 


using gradient descent. We first perform dimensionality reduction via PCA, which 
is then followed by t-SNE. In our experiments, we observed that this combination 
of methods improves the effectiveness of mapping anomaly predictions onto a two- 
dimensional embedding space. Here, the embedding ideally creates neighborhoods 
of visually related anomalies. 


3. Novelty segmentation: If anomalies of the same category are detected more 
frequently, they are expected to form a bigger cluster in the embedding space. Those 
clusters can be identified by employing algorithms such as density-based spatial 
clustering of applications with noise (DBSCAN) [EKS X96]. This algorithm supports 
the idea of non-supervision since it does not require any information of the potential 
anomaly data, such as e.g., the number of clusters. Moreover, DBSCAN divides 
data points into core points, border points, and noise, depending on the size of the 
neighborhood € € [0, oo) and the minimal number of a core point’s neighbors ô € N. 

More precisely, let 8 € R? denote the two-dimensional representation of x. 
Then, &) is considered as a core point, if the corresponding point-wise density 
e(@™) := {8 : eg — ©] < e, k' € K}| > ô, i.e., the e-neighborhood of g% 
contains at least 6 points including itself. We denote the neighborhood of a core point 
g®, which corresponds to a component Å € K, as B; := {8% : Ig® -g0 < 
g, k' € K}. If &™ is not a core point but belongs to a core point’s neighborhood, we 
call it a border point. Otherwise, i.e., if is neither a core point nor within a core 
point’s neighborhood, we call it noise. 

Finally, aclusterC; C K, j € J := {1,..., J} of components is formed by merg- 
ing overlapping neighborhoods Bz, yielding J € N clusters in total. In other words, 
clusters are formed from connected core points and their neighborhoods’ border 
points. Given p(&), we can determine the cluster density of C j OB, 


as the maximum max x o8“ ) or as the average —— 2 pe ). 
i z, | ke; 


The cluster C* C K, which is the cluster of highest density given a sufficient cluster 
size, is then selected to be further processed. To this end, let us consider the predicted 
segmentation mask F(x) = m = (m;)jez, where m; = arg maXses Yis, i € Z. The 
pseudo labels y = (¥;)jez for the originally unlabeled x are then obtained by setting 
yi = S + lif pixel location i belongs toa component k € C*, and y; = m; otherwise. 


5.2 Class-Incremental Learning 


Let Ÿ denote the set of pseudo labels, then the training data for some novel class S + 1 
can be represented by DS+! c Dmvel x Ý, where D" denotes the set of previously- 
unseen images containing novel classes. By extending the semantic segmentation 
network F to Ft : 14*”*3 _, R4*Wx(S+) and retraining Ft on DS+!, we perform 
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incremental learning to add a novel and previously unknown class to the semantic 
space of F. 


Regularization: Knowledge distillation is a subcategory of regularization strategies 
aiming to mitigate a catastrophic forgetting, i.e., these strategies try to mitigate per- 
formance loss on the previously-learned classes S = {1,..., S} while learning the 
additional class S$ + 1. In [MZ19], the authors adapted incremental learning tech- 
niques to the task of semantic segmentation. Among others, they introduced the 
overall objective 


Jy, F) = (1 — A)JE (Ft (x), 9) + AJP (Ft (x), F(x), Acl, (33) 


where (x, y) € D*t!. Here, J™® denotes the common cross-entropy loss over the 
enlarged set of class indices St := {1,..., S + 1} and J? the distillation loss. The 
latter loss is defined as 


JPL (F* (x), F(x)) := — — > > softmax; (y;) log(softmax, (y;)) (34) 
ieZ seS 


with y = F(x) and yt = F+ (x). Knowledge distillation can be further improved by 
freezing the weights of the encoder part of F* during the training procedure [MZ19]. 


Rehearsal: If the original training data Din C ¥ x YV of network F is available, 
in incremental learning such data is usually re-integrated into the training set of 
the extended network F*, i.e., the training samples are drawn from pain  DS+!, 
To save computational costs of training and to balance the amount of old and new 
training data, established methods, e.g., [Rob95], only use a subset of pin This 
subset is typically obtained by randomly sampling a set from D" that matches the 
size of |D5+1|. 

In combination with knowledge distillation, rehearsal strategies can be employed 
to mitigate a loss of performance on classes that are related to the novel class. This 
issue may arise e.g., through visual similarity such as between classes like bus and 
train, or due to class affiliation as in the case of bicycle and rider. Relevant classes 
can be identified by their frequency of being predicted on the relabeled pixels, i.e., 


plot — lieZ|m=s A ¥,=S+)| VseS, (35) 
(x,pyeDSt! 


and hence 
plot 
rel 


Va = yo Vs € S š (36) 
ves “y’ 


The subset of D" is then randomly sampled under the constraint that there are at 
least v"!|D5*"| images containing the class s for all s € S. 
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{Segmentation prediction of the initial model] 


[Prediction quality rating] 


Fig. 4 Predicted semantic segmentation mask of the initial model’s prediction (top) and corre- 
sponding segment-wise quality estimation (bottom) for one example from the Cityscapes test split. 
Green color indicates a high segment-wise IoU, red color indicates a low one 


5.3 Experiments and Evaluation 


In the following experiments, we will employ a DeepLabV3+ [CZP+18] model with 
an underlying WiderResNet38 [ZSR+19] backbone for semantic segmentation. 
This network is initially trained on a set of 17 classes, which we will extend by a 
novel class. The already trained classes are the Cityscapes training classes except 
pedestrian and rider, i.e., we exclude any human in the training process of our initial 
semantic segmentation network F. 

The initial model was trained on the Cityscapes [COR+16] training data. For the 
incremental learning process, we use a portion of those data and combine them with 
our generated disjoint training set DS+! containing previously unseen images and 
pseudo labels on novel objects. Here, the images from DS+! are drawn from the 
Cityscapes test data. For evaluation purposes, we use the Cityscapes validation data. 
Hence, during the incremental learning process only known objects are presented to 
the model except humans and a few instances, such as the ego-car or mountains in 
an image background, belonging to the Cityscapes void category. 
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Fig. 5 Relative frequency of old classes being predicted by the initial model on pixels that are 
assigned to the novel class. Thus, the subset of DE” included in the retraining should mainly 
involve bicycles, motorcycles, and cars 


We use the idea of meta classification, similarly as introduced in Sect. 4.3, to rate 
the prediction quality of predicted semantic segmentation masks. Here, the meta task 
is to estimate the segment-wise IoU first, see Fig.4, on which we apply thresholding 
(at t = 0.5) to determine potential anomalies, cf. [ORF20]. We employ gradient 
boosting as meta model, which achieves a coefficient of determination of R? = 
82.51% in estimating the segment-wise IoU on the Cityscapes validation split. 

In accordance to Sect. 2 and as already observed in Sect. 4.3, the softmax entropy 
is again one of the main metrics included in the meta model to identify anomalous 
predictions. Thus, the entropy shows to have great impact on meta classification 
performance, which, similarly, has also been observed in [CRH+20, CRG21]. 

Given anomaly segmentation masks, we perform image embedding using the 
encoder of the image classification network DenseNet 201 [HLMW17], that is pre- 
trained on ImageNet [DDS+09]. Next, we reduce the dimensionality of the resulting 
feature vectors to 50 via PCA and further to 2 by applying t-SNE. In [ORF20], a qual- 
itative and quantitative evaluation of different embedding approaches is provided. 
Note that t-SNE is non-deterministic, i.e., we obtain slightly different embedding 
spaces for different runs. In our experiment, employing DBSCAN with parame- 
ters € = 2.5 and ô = 15 produces a human-cluster including 91 components from 
76 different images. The most frequently predicted class of these components are 
car, motorcycle, and bicycle with vi = 24.84%, ue = 26.69% and vil = 33.53%, 
respectively, see Fig. 5. 

We train the extended model F* as described in Sect. 5.2 for 70 epochs, weighting 
the loss functions in (33) equally, i.e., A = 0.5. The extended model shows the ability 
to retain its initial knowledge by achieving an mloU score of 68.24% on the old 
classes when evaluating on the Cityscapes validation data. This yields a marginal 
loss of only 0.39% compared to the initial model F. At the same time, F* predicts 
the novel human class with a class IoU of 41.42%, without a single annotated human 
instance in the training data D**!. A visual example of our unsupervised novelty 
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Fig. 6 Comparison of the 
predicted semantic 
segmentation masks before 
(top) and after (bottom) 
adapting the model to the 
novel human class (orange) 
for one example of the 
Cityscapes validation split 
(middle). Here, the novel 
components are highlighted 
in orange, green contours 
indicate the ground truth 
annotation of the novelty 


[Prediction of initial model] 


[Novel class in validation image] 


[Prediction of extended model] 


Fig. 7 A comparison of anomaly scores obtained by meta classification (left) and entropy (right) 
on an image from RoadObstacle21. The dog on the image is the anomaly of interest (indicated by 
green contours), which would have been overlooked by meta classification but entirely detected by 


the entropy 


segmentation approach is provided in Fig. 6, more details on the numerical evaluation 


is given in Table 3. 


R. Chan et al. 


304 


SOSL 809L | LEOL 199€ 9L'S9 ITIS 906L SYS6 BYLS LELO £199 9S6 FPSIL 96°99 OLS9 SETI LIPS CIS6 CV68 SL'86 Teed 
1678 8778 | T09L 8E79 8916 9TL8 8878 9756 ELOS 6956 99SL ITH6 6688 TOIL PLEL 869 EF BL IT6 IE'68 89°86 
SL'99 O89 |TL'LS SHOE 9009 TSTL OEGI ITI6 Wih Lycee ESHS LI06 CLS9 8TSS STES I€6h Br Lr O68 8L08 ILO 
POL 608 |YTL8 TETI LBIB SESE 6078 LTL6 0000 8146 LILO S8'S6 I9 PL 8649 O99 789 TEIS OSS6 9168 9686 
9CSL OL6L | OE6S LEBT EC'ES OS88 E708 LOT 0000 896 99'LL 096 C798 POSL 897L 9L'99 LSPL O8'%6 668 S86 
C89 €989 | LSPS Sre SEOL 069L 99°69 19°06 0000 9L'E6 6795 S668 9999 PYSS DOTS COTS OLD 1688 £908 Pele 


sino 


rentur 


IJU 


=z B8 ¢ g 3 pg ME E s s E E B Pp 4 Z B B 
G] al -A 5 = a ia! ga 2 5 S & a 5 
& 2 = z E 5 s 19 =a i= 5 B = E & A 
g g z ~ E E. S a 8 8 a = 
o zi 5 = a = 5 S, 

< S ce LEN i x 

© 5 5 


uewny ‘our ueow 


uewny ‘9x3 ueIw 


+52 WO Vp oy) Zurpw3I ‘Sururen 
SULINP polOUs! ae Ady) “II PUNOIFYILVq SE PALI AL SISSEJO JOYIO JTY 'UPUNY SSJ JIAOU M} 0} pALZIIZZL E 42p14 PUL UDLUJSapad SƏSSLJI AL, 'JesTeəyəI pue 
UOTILT[NSIp oSpoamouy YIM (AvIS ur payYySTpYysiy) uvwny Ayoaou oy) JUULIA [eJUOUTOIOUT 197e pue 91079q [ds uoNepTea sədeosÁ o oy} JO UONenpeLAJ EAWPL 


Detecting and Learning the Unknown in Semantic Segmentation 305 


5.4 Outlook on Improving Unsupervised Learning of Novel 
Classes 


In the preliminary experiments presented in this section, we demonstrated that a 
semantic segmentation network can be extended by a novel class in an unsupervised 
fashion. As a basis to start, our unsupervised learning approach requires anomaly 
segmentation masks. Currently, these are obtained by meta classification [RCH+20], 
which is, however, not a method specifically tailored for the task of anomaly seg- 
mentation. In other words, the obtained masks are possibly inaccurate. To be even 
more precise on the limitation of plain meta classification, this method is only able 
to find anomalies when the segmentation model produces a (false positive) object 
prediction on those anomalies. By design, meta classifiers cannot find overlooked 
instances, e.g., obstacles on the road which also have been classified as road. As an 
illustration of this issue, we refer to Fig. 7. 

Having now several methods at hand, that we, e.g., introduced in Sect.4.1, it 
seems obvious to replace the underlying anomaly segmentation method by a more 
sophisticated one as future work. In particular, given the decent performance of our 
unsupervised learning approach relying only on meta classification and the entropy 
measure as highly beneficial metric for meta classification, combining entropy max- 
imization and meta classification is a promising approach to improve the presented 
novelty training approach. 


Conclusions 


Semantic segmentation as a supervised learning task is typically performed by models 
that operate on a given set containing a fixed number of classes. This is in clear 
contrast to the open world scenarios to which practitioners contemplate the usage of 
segmentation models. There are important capabilities that standard segmentation 
models do not exhibit. Among them is the capability to know when they face an 
object of a class they have not learned — i.e., to perform anomaly segmentation — 
as well as the capability to realize that similar objects, presumably of the same (yet 
unknown) class, appear frequently and should be learned either as a new class or be 
attributed to an existing one. In this chapter, we have seen first promising results for 
two tasks, for anomaly segmentation as well as for the detection and unsupervised 
learning of new classes. 

For anomaly segmentation, we considered a number of generic baseline methods 
stemming from image classification as well as some recent anomaly segmentation 
methods. Since the latter clearly outperforms the former, this stresses the need for 
the development of methods specifically designed for anomaly segmentation. We 
have demonstrated with our entropy maximization method empirically as well as 
theoretically that good proxies in combination with training on anomaly examples for 
high entropy are key to equip standard semantic segmentation models with anomaly 
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segmentation capabilities. Particularly on the challenging RoadObstacle21 dataset 
with diverse street scenarios, entropy maximization yields great performance which 
is not reached by any other method so far. While there exists a moderate number of 
datasets for anomaly segmentation, there is clearly still the need of additional datasets. 
The number of possible unknown object classes not covered by these datasets is 
evidently enormous. Furthermore, also the vast variety of possible environmental 
conditions and further domain shifts that may occur, possibly also in combination 
with unknown objects, continuously demand their exploration. 

For detection and unsupervised learning of new classes, we demonstrated in pre- 
liminary experiments that a combination of well-established dimensionality reduc- 
tion and clustering methods along with the advanced uncertainty quantification 
method for semantic segmentation called MetaSeg is well able to detect unknown 
classes of which objects appear relatively frequently in a given test set. Indeed, 
MetaSeg can also be used to define segmentation proposals for pseudo ground-truth 
of new classes, which can also be learned incrementally by the segmentation model. 
For the considered scenario of subsequently learning humans within the Cityscapes 
dataset, this approach yields an IoU of 41.42% on the novel class without losing per- 
formance on the original classes. The proposed methodology may help to incorporate 
new classes into existing models with low human labeling effort. The necessity for 
this will occur repeatedly in future. An example are the electric scooters that recently 
arose in several metropolitan areas across the globe. This is an example for a global 
phenomenon. However, also local phenomena, such as boat trailers at the coast, could 
be of interest. Such classes can be initially incorporated into an existing model using 
our methodology. Afterwards, the initial performance could be further improved with 
active learning approaches, such as [CRGR21], still requiring only a small amount 
of human labeling effort. It is also an open question, to which extent the proposed 
method can be used iteratively to improve the performance on a new class. Also for 
this track of research, the lack of data for pursuing that task is a limiting factor as of 
now. 
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Evaluating Mixture-of-Experts A) 
Architectures for Network Aggregation giecik 


Svetlana Pavlitskaya, Christian Hubschneider, and Michael Weber 


Abstract The mixture-of-experts (MoE) architecture is an approach to aggregate 
several expert components via an additional gating module, which learns to predict 
the most suitable distribution of the expert’s outputs for each input. An MoE thus 
not only relies on redundancy for increased robustness—we also demonstrate how 
this architecture can provide additional interpretability, while retaining performance 
similar to a standalone network. As an example, we train expert networks to perform 
semantic segmentation of the traffic scenes and combine them into an MoE with an 
additional gating network. Our experiments with two different expert model archi- 
tectures (FRRN and DeepLabv3+) reveal that the MoE is able to reach, and for 
certain data subsets even surpass, the baseline performance and also outperforms a 
simple aggregation via ensembling. A further advantage of an MoE is the increased 
interpretability—a comparison of pixel-wise predictions of the whole MoE model 
and the participating experts’ help to identify regions of high uncertainty in an input. 


1 Introduction 


An intelligent combination of redundant, complementary functional modules is one 
of the strategies to enhance the accuracy of an overall system as well as its robustness 
to unseen data. Existing aggregation methods either combine neural network outputs, 
as is the case with ensembles, or include combinations at the structural level, generally 
referred to as fusion, including multi-stream and multi-head neural networks. 
Ensembles of identical models are a typical approach to gain superior accuracy by 
combining several distinct modules. Combination strategies include (un-)weighted 
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averaging, majority voting, stacking, or creating a committee with a combination 
rule. Ensembles are conceptually simple, deterministic combination rules that are 
easy to implement and interpret. The approach also easily scales to a multitude 
of modules. Deep ensembles, i.e., ensembles of deep neural networks, trained with 
different random initialization parameters, have become one of the de facto standards 
for achieving better test accuracy and model robustness [LPB17]. 

However, since combination rules in ensembling are static, no sophisticated com- 
binations are possible. In the case of deep learning models, each model usually needs 
to be trained separately. In addition, the overall results can only be calculated after 
the outputs of all experts are calculated. 

Fusion is on the opposite side in the spectrum of model combination approaches. 
Information from several models is combined either starting from early layers (early 
fusion), over multiple network layers (slow fusion), or by merging only the last 
layers (late fusion) [KTS+14]. Regardless of the fusion type, these methods usually 
involve a complex implementation and allow for no or only little insights into the 
decision-making process itself. 

The mixture-of-experts architecture, first proposed by Jacobs et al. [JJNH91], takes 
a middle path and combines the simplicity and interpretability of the result with the 
possibility to form sophisticated network architectures. Similar to an ensemble, a 
mixture of experts combines several separate modules named experts. However, an 


Expert 1 L 


Expert 2 


Combination 
Rule 


(a) Aggregation via ensembling 


Expert 3 


(b) Aggregation via a mixture of experts 


Fig. 1 In an ensemble (a), the combination rule is deterministic, so that the distribution of expert 
weights is fixed for all inputs. A mixture of experts (MoE) (b) contains an additional gate module, 
which predicts the distribution of the expert weights for each input 
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additional, trainable gating component allows to perform weighting of the expert 
outputs on a per-input basis (see Fig. 1). 

Here, we want to introduce the general concept more thoroughly and extend 
the evaluation from our previous work [PHW+20] to demonstrate the abilities and 
possibilities of MoE architectures to perform the task at hand (semantic segmentation 
of traffic scenes) and to provide post-hoc explainability via a comparison of internal 
decisions. 

The contributions on top of our previous work [PHW+20] are as follows: The 
experiments are extended to include and evaluate a second expert architecture 
(DeepLabv3+). We also provide new results for the complete label set of the A2D2 
dataset. Moreover, a deeper analysis of the impact of the chosen feature extraction 
layer for the FRRN architecture is performed, whereas in previous work only two 
last pooling layers were considered. Then, expert weights as predicted by different 
gate architectures and the benefit of an additional convolutional layer is analyzed 
and discussed. A more advanced MoE architecture is considered and evaluated, in 
which the encoder layers of the experts are shared. Finally, an additional comparison 
of the proposed MoE architecture to an ensemble of experts is made. 


2 Related Works 


Mixtures of experts have primarily been applied for the divide-and-conquer tasks, 
where the input space is split into disjoint subsets, so that each expert is specialized 
on a separate subtask and a gating component learns a distribution of these experts. 

One of the application areas of MoEs studied in current publications is the fine- 
grained image classification, where the task is to discriminate between classes in 
a sub-category of objects, such as the bird classification task [GBM+16]. Ahmed 
et al. [ABT16] propose a tree-like structure of experts, where the first expert (a 
generalist) learns to discriminate between coarse groups of classes, whereas further 
experts (specialists) learn to categorize within a specific group of objects. This way, 
at inference time, the generalist model acts as a gate by selecting the correct specialist 
for each input. 

The mixture-of-experts approach has also been extensively applied to tackle multi- 
sensor fusion. To achieve this, each expert is trained on a certain sensor modality and 
a gating component combines outputs from experts for different input types. Mees et 
al. [MEB 16] apply a mixture of three experts for different modalities (RGB, depth, 
and optical flow input) to the task of object recognition in mixed indoor and outdoor 
environments. An analysis of the weights predicted by a gate shows that the MoE 
adapts to the changing environment by choosing the most appropriate combination 
of expert weights—e.g., a depth expert gets a higher weight for darker image frames. 
A similar setup is used in [VDB16], where it is applied to the task of semantic 
segmentation. Similar to the previous setting, the MoE demonstrates the ability to 
compensate for a failed sensor via an appropriate weight combination. 
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Furthermore, in [FC19] and [FC20] the MoE approach is used to handle the 
multi-modal end-to-end learning for autonomous driving. 

A stacked MoE model, consisting of two MoEs, each containing its own experts 
and corresponding gates, is proposed in [ERS14]. Although evaluated only on 
MNIST, this work was an important step toward studying large models, which apply 
conditional execution of a certain part of the whole architecture for each image. The 
idea was further developed in an influential work on sparse MoEs for language mod- 
eling and machine translation [SMM+17], where a generic MoE layer is proposed. 
Similar to a general MoE, the embedded MoE layer contains several expert networks 
as well as a gating component, but it forms only a part of the overall architecture 
trained in an end-to-end manner. During the inference, the experts with non-zero 
weights are activated in each MoE layer, thus enabling conditional computation on 
a per-input basis. 

The mixture-of-experts approach is also tightly connected to the selective exe- 
cution or conditional computation paradigm. Normally, a straightforward way to 
boost the performance of a deep learning model, given a sufficiently large dataset, 
is to increase its capacity, i.e., the number of parameters. This, however, leads to the 
quadratic increase in the computational costs for training [SMM+17]. One of the 
approaches to overcome this is to execute only a certain part of the network in an 
input-dependent manner. In this area, primarily approaches such as dynamic deep 
networks [LD18] have been used. A large model with hundreds of MoE layers with 
dynamic routing is explored in [WYD+19]. Recent work [LLX+21] continues this 
line of research and applies conditional computation and automatic sharding to a 
large, sparsely gated MoE combined with transformers for neural machine transla- 
tion. 

In contrast to existing work, we treat the MoE not only as a method to achieve 
higher accuracy but also as a post-hoc interpretability approach. The described archi- 
tecture is closer to earlier MoE methods [MEB 16] and less complex than the stacked 
and embedded MoE. 


3 Methods 


In this section, we introduce the mixture-of-experts approach formally and show 
how discrepancy masks may arise from a comparison of the intermediate predic- 
tions within an MoE and how they can serve as a means to allocate regions of high 
uncertainty in an input. 


3.1 MoE Architecture 


The mixture-of-experts architecture presented here originates from our previous 
work [PHW+20] and is partially inspired by Valada et al. [VDB16]. In this set- 
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Fig. 2 Architecture of an MoE with separate feature extractors. The gate module receives a con- 


. t . 
catenation of feature maps oo and pe 12 extracted from the layer £ in an encoder part of each 


of the experts. The gate then predicts the expert weights w*P*™ and w°*Per"2 


ting, the MoE consists of two experts, each trained on a separate subset of data, as 
well as a small gating network. Deviating from the standard MoE approach, we reuse 
existing feature maps, extracted from the encoder part of the experts, as inputs to a 
gate. These feature maps are much smaller and can thus speed up both training and 
inference (see Fig. 2). 

Formally, an MoE FME consists of n experts {F%P},_) , and a gate Fee, 
Assume each expert F®P®™t» follows an encoder-decoder architecture. We select a 
certain feature extraction layer £ from the encoder part of each expert. The gate F8*° 
receives a concatenation of feature maps {oo from layer £ of each expert and 
computes expert weights for an input x € Z¥*W*C with height H, width W, number 
of channels C = 3, and Z = [0, 1] via 

) . d) 
v=1,...,n 


To calculate the overall MoE prediction, the weighted sum of logits y*?"» of all 
experts is computed as 


expert 
(werent ver worPettn ) = Fete (« pert, 


y= pM (x) = > WEP yore, . (2) 
v=1 


The resulting approach can thus be interpreted as an extension of ensembles, 
where weighting of the outputs of individual models is not predefined, but chosen 
according to the outputs of a trainable gate. 

To further increase the capacity of the model, this architecture can be enhanced 
in a late fusion manner. This can be achieved by appending an additional convolu- 
tional layer which processes the weighted sum of the expert outputs before the final 
prediction FM® is computed (c.f. Fig. 2). 
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Fig. 3 Architecture of an MoE with a shared feature extractor. In contrast to the architecture in 
Fig. 2, the gate module receives a single feature map fo from the shared feature extractor as 
an input 


Since convolutional neural networks tend to learn similar generic features in the 
lower layers, these layers can be shared across experts to reduce the overall number 
of parameters (see Fig. 3). In this case, each of the expert decoders and the gate 
receive the same pre-processed input re, 


3.2 Disagreements Within an MoE 


The multi-component nature of an MoE allows analyzing the final predictions via 
comparison of several intermediate predictions, which are either taken into consider- 
ation or declined by the overall model. The final decision of an MoE is performed on 
a per-pixel basis and consists of weighting individual decisions of the participating 
experts. 

By comparison of the predictions of single experts y°P*"» and prediction of the 
MoE y™°£, we are able to decide for each pixel whether its classification presents a 
difficulty to the overall model. 

We consider the following three cases that could arise. The discrepancy mask 
Y = (yn.w) E€ B@* has the same height H and width W as the MoE prediction 
yM°E, with h and w being row and column index, and B = {0, 1}. For each case, a 
corresponding discrepancy mask for an input image is calculated as follows (note 
that the symbol /\ represents the logical conjunction operator): 


e Perfect case: Here, the prediction by all experts and the MoE is the same. The 
elements of the discrepancy mask y?*"*t for the perfect case are computed as 
follows: 


1 St eS ((ar max(y"P")),,_, = (arg max"), .) k 
v=1 
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e Normal case: Here, the prediction by one of the experts and the MoE is the same. 
This case can further be split into normal case 1, where the same prediction is 
output by expert 1 and the MoE, normal case 2 for expert 2 and the MoE, etc. The 
elements of the discrepancy mask y°™" for the normal case 1 are computed as 
follows: 


normal, 


a ST SS (í arg max(yP™ ies = (arg mary"), A 
n (4) 
N (í arg max(y®™®)) # (arg max(y"*)), 
v=2 


e Critical case: The MoE and experts output different predictions. The elements of 
the discrepancy mask y““*! for the critical case are computed as follows: 


yii A ((ae max(y"")),, 3 (arg max(y*®*)), ) (5) 
v=1 


The critical case, where the MoE outputs a prediction, which was not foreseen 
by any of the experts, helps to identify image regions, which are challenging for the 
current model and could, for example, require more relevant examples in the training 
data. 


4 Experiment Setup 


To demonstrate the mixture-of-experts architecture, we chose the application of 
semantic image segmentation of traffic scenes. The experiments build upon and 
extend our previous work [PHW+20]. 


4.1 Datasets and Metrics 


Dataset: All experiments were performed using the A2D2 dataset [GKM+20], con- 
sisting of RGB images with detailed semantic segmentation masks. In addition to 
the semantic segmentation labels, we manually labeled front camera images by road 
type (highway vs. urban vs. ambiguous). The disjoint subsets of highway and urban 
images were then used to train two expert networks, one for each data subset. Even 
though the urban images outnumber those depicting highway scenes, the same num- 
ber of training (6132) and validation (876) samples are used for each expert. Addi- 
tionally, 1421 test samples of each of the three road classes (including ambiguous) 
are available. All images are resized to 640 x 480 pixels for training and inference. 
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In contrast to previous work [PHW+20], we use the complete label set of the A2D2 
dataset here, comprising 38 classes. 


Metrics: The distribution of labels in the A2D2 dataset is highly imbalanced. In 
particular, objects of nine out of 38 classes occur in less than 5% of images. To help 
alleviate this challenge, in addition to the mean intersection-over-union (mIoU), the 
frequency-weighted IoU (fwloU) was used for evaluation purposes. In the case of 
the mloU, the class-wise average is taken. To calculate the fwIoU, each class-wise 
IoU is weighted by class frequency, pre-computed as a ratio of pixels labeled as 
belonging to a certain class in the dataset. 


4.2 Topologies 


Expert architectures: To lay the focus on the overall mixture-of-experts archi- 
tecture, two different expert base architectures are evaluated: the Full-Resolution 
Residual Network (FRRN) [PHML17] and DeepLabv3+ [CZP+18]. 

An FRRN network is comprised of two interconnected streams: a pooling and a 
residual one. While the residual stream carries the information at the original image 
resolution, the pooling stream follows the encoder-decoder approach with a series of 
pooling and unpooling operations. Furthermore, the whole FRRN is constructed as 
a series of the full-resolution residual units (FRRUs). Each FRRU takes information 
from both streams as input and also outputs information for both streams, thus hav- 
ing two inputs and two outputs. An FRRU contains two convolutional layers, each 
followed by a ReLU activation function. 

For the following experiments, we use a shallower version of FRRN, namely 
FRRN-A. The encoder part of the FRRN-A pooling stream consists of ten FRRU 
blocks, which are distributed into four groups. Each group contains FRRU blocks 
with the same number of channels in their convolution layers; the group of FRRU 
blocks is correspondingly denoted by the number of channels (e.g., FRRUo¢ is a 
group of FRRUs containing convolutions with 96 channels). Each group is then 
followed by a pooling layer (max-pooling in our case). 

The second expert architecture, DeepLabv3+, uses a previous network version 
(DeepLabv3) as an encoder and enhances it with a decoder module. This encoder- 
decoder architecture of DeepLabv3+ suggests to use the feature maps of the final 
encoder layer, namely the atrous spatial pyramid pooling (ASPP) layer, as input to 
the gate. For the experts based on DeepLabv3+, we evaluate architectures with the 
ResNet-101 backbone [HZRS16], pre-trained on ImageNet [RDS+15]. 


Gating: The gate is a core MoE module. It predicts expert probabilities, which can 
then be directly used to weight expert outputs. Figure 4 demonstrates the gate archi- 
tecture. We consider two possibilities to incorporate a gate into the MoE architecture: 
a simple and a class-wise gate. A simple gate predicts a distribution of expert weights 
for the whole input image. The output of a simple gate is thus a list of n scalars, one 
for each expert. A class-wise gate, originally proposed in [VVDB17], is a list of 
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expert, 


| uw 
pre Conv(3x3,1) + ReLU FC(128) FC(2) 
Fig. 4 Gate architecture. Gate input i is either the concatenation of feature maps i ! and 


i *"2 (in case of MoE architecture from Fig. 2) or a single feature map fence from the shared 
encoder (in case of the MoE architecture from Fig. 3). Gate outputs w°*P* and w**P*"2 are the 
corresponding expert weights 


simple gates, one for each label. Each simple gate follows the same architecture as 
shown in Fig. 4. The output of a class-wise gate is thus a tensor of dimension n x m, 
where n is the number of experts and m is the number of classes. Via multiplication 
with the two-dimensional gate outputs, the expert outputs are therefore weighted in 
a class-wise manner. Overall, a simple gate learns to predict how much each expert 
can be trusted for a specific input, whereas a class-wise gate learns to decide which 
expert is most suitable for each label given a specific input. 

Moreover, we evaluate the addition of a further convolutional layer f7°"’, followed 
by a ReLU activation function. f°" is additionally inserted (see Figs. 2 and 3), such 
that the overall MoE output is then computed as follows: 


y= FME (x) = o (f° (yoxPert . perPerts + yore i werPet2y) (6) 


4.3 Training 


Pipelines for training and evaluation are implemented in PyTorch [PGM+19]. 
NVIDIA GeForce GTX 1080 Ti graphics cards are used for training and infer- 
ence. The MoE with separate feature extractors (as shown in Fig. 2) is trained in two 
stages. First, each expert is trained as a separate neural network on the corresponding 
data subset. Then, the expert weights are frozen and the whole MoE architecture is 
trained end-to-end. In the case of a shared encoder architecture (as shown in Fig. 3), 
the expert networks are trained jointly at the first stage, whereas their encoder layers 
share the parameters. Afterwards, the weights of the shared encoder and of the expert 
decoders are frozen and the overall MoE is trained end-to-end. Each expert is trained 
for 50 epochs with a batch size of two. The MoE is trained for 20 epochs with a batch 
size of six. A smaller batch size is used for larger feature maps. SGD is used as an 
optimizer for all models. The polynomial learning rate decay with an initial value 
set to 0.01 is used. 
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5 Experimental Results and Discussion 


In the following sections, we discuss our findings from the experiments regarding 
architectural choices, analyze the performance of the MoE model, and discuss its 
potential as an interpretability approach. 


5.1 Architectural Choices 


Architectural choices reported in this section refer to Tables 1, and 2, 3 and have been 
obtained on the test dataset. Note that a good generalization on unseen test data can 
only be ensured if the respective ablation study is repeated on a separate validation 
dataset. 


Expert feature extraction layer: For the FRRN-based architecture with separate 
encoders, which is shown in Fig. 2, we evaluate feature maps, extracted from the 
last max-pooling layer in each FRRU block of the encoder. We additionally evaluate 
the usage of input images directly as gate input as in the original MoE approach. 
Table 1 demonstrates the results for the MoE architecture with a simple gate with 
an extra convolutional layer. Using raw input data as gate input has led to better 
results only on the highway data, while achieving the worst results on the urban data 
parts. Using the pooling layer of the last FRRU block as input, however, achieves 
the highest mIoU values on the mixed dataset. A possible reason might be that the 
extracted high-level features from the later layers serve as a better representation of 
the more complex urban images. The feature resolution shrinks via a series of pooling 
operations in the pooling stream, therefore the last FRRU blocks additionally lead to 
shorter training times. Since the best results on the mixed dataset were also achieved 
by a model with the second FRRU3g,4 feature extraction layer, it was selected for all 
further experiments. 


Gate architecture: To determine the best design choices for the gate architecture and 
whether to add additional convolutional layers after the gate, we train and evaluate 


Table 1 FRRN-based architecture: mloU/fwloU for different feature extraction layers. The results 
are for the simple gate architecture with the extra convolutional layer 


Layer Feature map | Highway Ambiguous Urban All 

size 
Input 640x480x3 | 0.449/0.949 | 0.396/0.939 | 0.395/0.806 | 0.412/0.890 
FRRU96 240x320 16 | 0.342/0.938 | 0.419/0.936 | 0.479/0.861 | 0.478/0.911 
FRRU 199 120x 160x 16 | 0.342/0.938 | 0.420/0.938 | 0.480/0.861 | 0.479/0.911 
FRRU384 60x 80x 16 0.346/0.940 | 0.420/0.938 | 0.480/0.861 | 0.480/0.912 
2"4 FRRU3g4 | 30x 40x 16 0.390/0.947 | 0.445/0.940 | 0.467/0.855 | 0.483/0.913 
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Table 2 FRRN-based architecture: mloU/fwloU for different gate architectures. The results are for 
the second FRRU384 as a feature extraction layer 


Gate Highway Ambiguous Urban All 

Simple 0.349/0.938 0.418/0.937 0.473/0.859 0.475/0.910 
Simple with conv. | 0.390/0.947 0.445/0.940 0.467/0.855 0.483/0.913 
Class-wise 0.373/0.942 0.443/0.939 0.472/0.861 0.475/0.913 
Class-wise with | 0.366/0.939 0.408/0.938 0.482/0.863 0.483/0.912 
conv. 


Table 3 DeepLabv3+-based architecture: mloU/fwloU for different gate architectures 


Gate Highway Ambiguous Urban All 

Simple 0.405/0.956 0.349/0.942 0.334/0.854 0.345/0.917 
Simple with conv. | 0.415/0.955 0.339/0.942 0.342/0.854 0.358/0.917 
Class-wise 0.399/0.953 0.360/0.943 0.314/0.849 0.333/0.914 
Class-wise with | 0.404/0.954 0.369/0.942 0.333/0.853 0.351/0.916 
conv. 


MoE models with separate encoders (as shown in Fig. 2) for different combinations 
of these features. 

The analysis of the gate’s predictions has demonstrated that the simple gate tends 
to select the correct expert for each data subset. As an example, the average expert 
weights as predicted by a simple gate for the DeepLabv3+-based architecture are 
as follows: 0.8 for the highway expert and 0.2 for the urban expert on the highway 
data subset, 0.04 for the highway expert and 0.96 for the urban expert on the urban 
data subset, and 0.52 for the highway expert and 0.48 for the urban expert on the 
ambiguous data. 

The class-wise gate, however, predicts almost uniform weights for the majority of 
classes. Only a small overweight is assigned to the highway expert on the highway 
data and correspondingly to the urban expert on urban data. Classes for which the 
urban expert is assigned a significantly higher weight on all inputs are Signal 
corpus, Sidewalk, Buildings, Curbstone, Nature object, and RD 
Normal Street. Moreover, we observed that the class-wise gate tends to predict 
the same weights for all inputs. 

Overall, the simple gate demonstrates much higher flexibility to input data, which 
explains the better results of the corresponding architectures (see Tables 2 and 3). 
Also, regardless of the gate architecture, an additional convolutional layer also leads 
to higher mloU values, because the overall model has a higher capacity. Both FRRN- 
and DeepLabv3+-based MoEs demonstrate the best performance for the simple 
gate with an additional convolutional layer after the gate. 
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5.2 MoE Performance 


To compare the performance of the proposed MoE architecture with a baseline and 
with further aggregation methods, we focus in the following experiments on the 
MoE with the separate feature encoders (as shown in Fig. 2), which uses the simple 
gate architecture, an additional convolutional layer and, in case of the FRRN-based 
model, the second FRRU3 84 as its feature extraction layer. Our conclusions on the 
performance of this model are also valid for almost all further architectural choices, 
presented in the previous subsection (cf. Tables 1, 2, and 3). Results discussed in this 
section refer to Tables 4 and 5 and have been obtained on the test dataset. 


MoE versus baseline versus experts: To ensure a proper basis for comparison, we 
train a baseline on a combined dataset of highway and urban data. Its architecture is 
identical to that of an individual expert. Since the baseline was exposed to twice as 
much data as each expert, it outperforms the expert models that are trained on their 
respective data subset only. As expected, each of the experts demonstrates the best 
results on its corresponding data subset, whereas the urban expert shows a slightly 
better generalization due to a higher diversity of traffic scenes in its training data. The 
performance of the DeepLabv3+-based MoE with the separate feature encoders 
(as shown in Fig. 2) surpasses that of the baseline on the mixed dataset, whereas the 
FRRN-based MoE approaches the baseline. 


MoE versus ensemble: We also evaluated an ensemble of experts as a concurrent 
aggregation approach. For this, the class-wise probabilities, predicted by the pre- 
trained experts, were combined using either by taking a mean or a maximum over 
the experts. Both combination strategies consistently underperform compared to an 


Table 4 FRRN-based architecture: mloU/fwloU for experts and the MoE. For the MoE, the results 
are shown for a simple gate with convolutional layer and second FRRU3g, as its feature extraction 
layer 


Model\Dataset | Highway Ambiguous Urban All 
Baseline 0.425/0.946 0.415/0.940 0.458/0.859 0.475/0.914 
Highway expert | 0.498/0.952 0.298/0.919 0.149/0.616 0.227/0.820 
Urban expert 0.323/0.934 0.407/0.936 0.476/0.859 0.472/0.908 
MoE 0.390/0.947 0.445/0.940 0.467/0.855 0.483/0.913 
Ensemble (mean) | 0.453/0.949 0.399/0.938 0.382/0.819 0.417/0.900 
Ensemble (max) | 0.436/0.951 0.410/0.935 0.390/0.823 0.439/0.900 
Highway expert | 0.494/0.954 0.359/0.939 0.272/0.786 0.227/0.820 
(shared) 

Urban expert 0.380/0.944 0.431/0.940 0.466/0.859 0.472/0.908 
(shared) 


MoE (shared) 0.386/0.952 0.410/0.893 0.438/0.810 0.462/0.873 
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Table 5 DeepLabv3+ -based architecture: mloU/fwloU for experts and the MoE. For the MoE, 
the results refer to the simple gate with convolutional layer 


Model\Dataset | Highway Ambiguous Urban All 
Baseline 0.421/0.956 0.364/0.944 0.302/0.841 0.321/0.913 
Highway expert | 0.405/0.955 0.252/0.928 0.098/0.607 0.154/0.821 
Urban expert 0.334/0.942 0.353/0.941 0.334/0.854 0.333/0.912 
MoE 0.415/0.955 0.339/0.942 0.342/0.854 0.358/0.917 
Ensemble (mean) | 0.372/0.953 0.320/0.942 0.235/0.802 0.266/0.897 
Ensemble (max) | 0.401/0.954 0.362/0.943 0.291/0.830 0.317/0.908 
Highway expert | 0.400/0.955 0.284/0.941 0.173/0.766 0.214/0.884 
(shared) 

Urban expert 0.396/0.952 0.357/0.944 0.294/0.835, 0.311/0.910 
(shared) 

MoE (shared) 0.401/0.952 0.293/0.941 0.252/0.813 0.267/0.911 


MoE with separate feature encoders on all data subsets, except for the FRRN-based 
architecture on highway data. Moreover, of those two approaches, the combination 
via maximum has led to slightly better results. 


MoE with a shared encoder: For the shared encoder architecture as shown in 
Fig. 3, the differences in performance of experts are less drastic, whereas the MoE 
demonstrates a slightly inferior mloU when compared to an MoE with separate 
encoders. It seems as if the insufficient specialization of the experts negatively affects 
the performance of the MoE model. Although the shared encoder provides faster 
inference, a further drawback of this approach is that joint training of both shared 
experts is much more time-consuming. This might limit the usage of the architecture, 
especially when a fast replacement of certain experts during inference is considered. 


5.3 Disagreement Analysis 


We compare pixel-wise predictions of each expert and of the overall MoE architec- 
ture. We report percentage of pixels belonging to each disagreement class in Tables 6 
and 7, using the test dataset. In both architectures, the experts and the overall MoE 
output the same predictions (perfect case) for the majority of pixels. Pixels belonging 
to the normal and critical cases take up to 5% of an image for highway and ambigu- 
ous data. For urban data, the MoE tends to rely heavily on the urban expert (up to 
27% of pixels), the urban expert is also more accurate on this data according to the 
reported mloU and fwloU. 
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Table 6 FRRN-based architecture: percentage of pixels, belonging to each disagreement case. The 
results refer to the architecture with an additional convolutional layer after the gate and second 


FRRU3g4 as a feature extraction layer 


Model Simple Class-wise 

Highway Ambiguous | Urban Highway Ambiguous | Urban 
Perfect case | 97.00 95.84 70.76 96.93 95.84 70.74 
Normal 0.63 0.40 0.55 0.59 0.40 0.64 
case 1 
Normal 2.14 3.51 27.87 2.18 351 27.70 
case 2 
Critical case] 0.22 0.25 0.82 0.30 0.25 0.92 


Table 7 DeepLabv3+ -based architecture: percentage of pixels, belonging to each disagreement 
case. The results refer to the architecture with an additional convolutional layer after the gate 


Gate Simple Class-wise 

Highway Ambiguous | Urban Highway Ambiguous | Urban 
Perfect case | 97.53 96.67 72.94 97.52 96.54 72.92 
Normal 2.02 1.34 0.39 1.62 1.17 1.12 
case 1 
Normal 0.28 1.78 25.98 0.61 1.91 24.94 
case 2 
Critical case} 0.17 0.21 0.69 0.25 0.37 1.02 


To facilitate visual analysis of the disagreement regions, we show a disagreement 
mask for each input (see Fig. 5). The perfect case pixels are left transparent, the 
normal case pixels are colored green, and blue for the highway and urban experts 
correspondingly. The critical case pixels are highlighted in red. The visual analysis 
is consistent with the semantics of the objects in the scene. Regions mapped to the 
normal case are those, which can confidently be segmented by the corresponding 
expert, because they mostly occur in its training set and not in the training set of a 
different expert. 

The critical cases are small image areas (up to 1% of pixels). Because we consider 
ambiguous traffic scenes as out-of-distribution in our setting, they provide the most 
interesting material for the visual analysis of the critical cases (see Fig. 6). The critical 
case areas are usually the overexposed, blurred, ambiguous, or hardly visible regions 
of an image. Interestingly, sidewalk pixels are often classified as belonging to the 
critical case in the ambiguous dataset. 
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Ground Truth MoE Prediction Disagreement Mask 


Urban Highway 


Ambiguous 


Fig. 5 Disagreement masks and predictions. Perfect case pixels are shown as image pixels, normal 
case pixels are highlighted green (agreement with the highway expert) or blue (agreement with the 
urban expert), and critical case pixels are highlighted red. The results are for the DeepLabv3+ 
architecture with a simple gate and additional convolutional layer 


Conclusions 


Mixture of experts (MoE) is a network aggregation approach, which combines sim- 
plicity and intrinsic interpretability of ensembling methods with the possibility to 
construct more flexible models via the application of an additional gate module. In 
this chapter, we have studied how a mixture of experts can be used to aggregate 
deep neural networks for semantic segmentation to increase performance and gain 
additional interpretability. 

Our experiments with two different expert architectures demonstrate that MoE is 
able to reach baseline performance and additionally reveal image regions, for which 
the model exhibits high uncertainty. In comparison to our previous experiments 
in [PHW+20], the models were trained on a full A2D2 label set, which led not only 
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Ground Truth MoE Prediction Disagreement Mask 


fe 


Fig. 6 Examples of disagreement masks with critical case from the ambiguous test set. Perfect 
case pixels are shown as image pixels, normal case pixels are highlighted green (agreement with 
the highway expert) or blue (agreement with the urban expert), and critical case pixels are high- 
lighted red. The results are for the DeepLabv3+ architecture with a simple gate and additional 
convolutional layer 


to decreased performance on rare classes, as expected, but also to the increased occur- 
rence of disagreements between the overall architecture and the experts. Furthermore, 
we have evaluated the possibility to share the parameters of the feature extraction 
layers of both experts. This leads to better cross-subset expert performance. How- 
ever, apparently the experts with a shared encoder are no longer specialized enough 
which leads to a worse MoE performance when compared to an MoE with standalone 
experts. Our evaluation has also shown that a mixture of experts, enhanced with a 
gating network, beats a simple combination of experts via ensembling. 
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Further research directions might include conditional execution within an MoE 
model, combination of various modalities as inputs, as well as the perspective to 
further enhance the interpretability and robustness of the overall model. 
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1 Introduction 


The objective of systems safety engineering is to avoid unreasonable risk of hazards 
that could lead to harm to the users of the system or its environment. The definition 
of an unreasonable level of risk must be determined for the specific system context in 
accordance with societal moral concepts and legal considerations regarding product 
liability. This requires a careful evaluation of the causes and impact of system failures, 
an evaluation of the probability of their occurrence, and the definition of strategies to 
eliminate or reduce the residual risk associated with such failures. Current automotive 
safety standards, such as ISO 26262 [ISO18] and ISO/DIS 21448 [I[SO21], require 
a safety case to be developed for safety-related functions that forms a structured 
argument supported by systematically collected evidence that the appropriate level 
of residual risk has been achieved. The term “evidence” refers to work products 
created during the development, test, and operation of the system that support the 
claim that the system meets its safety goals. 

In comparison to previous generations of vehicle systems, automated driving sys- 
tems (ADS) that make use of machine learning (ML) introduce significant new chal- 
lenges to the safety assurance process. Deep neural networks (DNNs), in particular, 
are seen as an enabling technology for ADS perception functions due to their ability to 
distinguish features within complex, unstructured data. This allows for the develop- 
ment of functions, such as pedestrian detection within crowded, urban environments 
that previously could not be specified and implemented based on algorithmic defini- 
tions. Paradoxically, this leads to one of the main challenges associated with the use of 
machine learning in safety-critical systems. The semantic gap [BHL+20] describes 
the challenge of deriving complete and consistent technical (safety) requirements 
on the system that fulfil potentially only implicitly understood, societal and legal 
expectations. The semantic gap is exacerbated in ML-based ADS due to the unpre- 
dictability and complexity of the operational domain and the reliance on properties of 
the training data rather than detailed specifications to derive an adequate approxima- 
tion of the target function. The semantic gap can in turn lead to an unclear definition 
of the moral responsibility and legal liability for the system’s actions as well as an 
incomplete assurance argument that the (potentially incompletely defined) safety 
goals for the system are met. 

Furthermore, deep learning approaches have specific properties that limit the 
effectiveness of established safety measures for software. These include the inherent 
uncertainty in the outputs of the ML function, the opaque manner in which features are 
learnt by the function which is often not understandable by humans and the difficulty 
of extrapolating from test results due to non-linear behaviour of the function and 
sensitivity to small changes in the input domain. Previous work related to the safety 
assurance of machine learning has mainly focused on the structure of the assurance 
case and associated processes with respect to existing safety standards [BGH17, 
SQC17, GHP+20, ACP21, BKS+21]. Other work has focused on the effectiveness 
of specific metrics and measures on providing meaningful statements related to safety 
properties of the ML function [CNH+18, HSRW20, SKR+21, CKL21]. This chapter 
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complements this work by examining in more detail the role of combining various 
types of evidence to form a convincing argument that quantitative acceptance criteria 
as required by standards such as ISO 21448 are met. 

This chapter discusses the challenge of deriving suitable acceptance criteria for 
ML functions and describes how supporting evidence can be combined to provide 
a convincing safety assurance case for the system. The aggregation of evidence 
can be structured as an overall evidence-based safety argumentation as depicted in 
[MSB+21]. In the following section, we describe the challenge of deriving a set of 
technical safety requirements and associated acceptance criteria on ML functions. 
In Sect.3, we provide a categorisation of safety evidence for machine learning and 
argue that an assurance case requires an appropriate combination of complementary 
evidence. Section4 describes a collaborative approach between domain and safety 
experts for collecting and evaluating the safety evidence. Sections 5.1-5.4 provide 
specific examples of each category of evidence and describe the conditions under 
which they can provide a meaningful contribution to the assurance case. Section 6 
then demonstrates how the evidence can be combined into an overall assurance case 
structure. The chapter concludes with a summary of open research questions and an 
outlook for future work. 


2 Deriving Acceptance Criteria for ML Functions 


This section presents approaches to risk evaluation within the safety standards and 
discusses how safety acceptance criteria for ML functions can be derived in line with 
these approaches. 


2.1 Definition of Risk According to Current Safety Standards 


Current standards related to the safety of vehicle systems provide little specific guid- 
ance related to the use of ML. Therefore, the approach described in this chapter will 
be based upon an interpretation and transfer of the principles of ISO 26262, ISO/DIS 
21448, and ISO/TR 4804 to the specific task of safety assurance for machine learning- 
based systems. ISO 26262 defines functional safety as the absence of unreasonable 
risk due to hazards caused by malfunctioning behaviour of electrical/electronic sys- 
tems. Malfunctioning behaviour is typically interpreted as either random hardware 
faults or systematic errors in the system, hardware, or software design. ISO 26262 
applies a predominantly qualitative approach to arguing the safety of software. Safety 
goals for a system are derived according to a hazard and risk analysis that evaluates 
the risk associated with each hazard according to qualitative criteria with various cat- 
egories for the risk parameters: severity, exposure, and controllability. The standard 
provides a method for combining these parameters to derive an overall “Automo- 
tive Safety Integrity Level” (ASIL) within a range of A to D in order of increasing 
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risk. Functions with no safety impact are assigned the level “QM” (only standard 
quality management approaches are necessary). An ASIL is allocated to each safety 
goal derived from the hazards and to the functional and technical safety requirements 
derived from these until eventually software-level requirements, including associated 
ASILs, are determined. The standard then provides guidance on which measures to 
apply, depending on ASIL, to achieve a tolerable level of residual risk. By doing 
so, the standard avoids the need to define specific failure rates to be achieved by 
software but instead defines desirable properties (such as freedom from run-time 
errors caused by pointer mismatches, division-by-zero, etc.) and methods suited to 
ensuring or evaluating these properties (e.g. static analysis). 

ISO/DIS 21448 addresses safety in terms of the absence of unreasonable risk due 
to functional insufficiencies of the system or by reasonably foreseeable misuse. In 
the context of ADS, this translates to a possible inability of a function to correctly 
comprehend the situation and operate safely, e.g. due to a lack of robustness regarding 
input variations or diverse environmental conditions. The risk model described by 
ISO/DIS 21448 can be summarised as identifying as many known, unsafe scenarios 
(where performance insufficiencies could lead to hazards) as possible so that miti- 
gating measures can be applied, thus transforming them to known, safe scenarios. In 
addition, the number of residual unknown, unsafe scenarios should be minimised by 
increasing the level of understanding of properties of the environment that could trig- 
ger performance deficiencies. The definition of safety of the intended functionality 
(SOTIF), therefore, seems well suited for arguing the performance of the trained ML 
function over all possible scenarios within the operating domain. ISO/DIS 21448 
follows a similar process to identify hazards, associated safety goals, and require- 
ments as ISO 26262. However, instead of using risk categories according to ASILs, 
the standard requires the definition of quantitative acceptance criteria (also known 
as validation targets) for each hazard, which in turn can be allocated to subsystems 
such as perception functions. However, these acceptance criteria are not described in 
more detail and must, therefore, be defined according to the specific system context. 

ISO/TR 4804 contains a set of guidelines for achieving the safety of automated 
driving systems with a focus on the Society of Automotive Engineers (SAE) levels 3— 
5 [SAE18]. ISO/TR 4804 defines safety acceptance criteria both in terms of a positive 
risk balance (the system shall be demonstrably safer than an average human driver) 
as well as the avoidance of unreasonable risk. In doing so, the standard recommends 
a combination of both qualitative and quantitative arguments. A similar philosophy 
is followed within this chapter when identifying and combining evidence for the 
safety of a machine learning function. 
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2.2 Definition of Safety Acceptance Criteria for the ML 
Function 


In order to address the semantic gap described above, a systematic method of deriv- 
ing a set of technical safety requirements on an ML function is needed. This should 
include an iterative and data-driven approach to analyse the societal and legal expec- 
tations, operating domain, and required performance on the ML function within the 
technical system context (see Fig. 1). In this chapter, we focus on steps 4 and 5, in 
particular, the systematic collection of evidence regarding the performance of an ML 
function to support a system-level assurance case. 

In [BGH17], the authors recommend a contract-based design approach to derive 
specific safety requirements for the machine learning function. These requirements 
are expressed in the form of a set of safety guarantees that the function must ful- 
fil and a set of assumptions which can be made in the system’s environmental and 
technical context. This allows for a compositional approach to reasoning about the 
safety of the system where the guarantees of one function can be used to justify the 
assumptions on the inputs of another. This requires a suitable definition of safety 
at the level of abstraction of the ML function. For example, a superficially defined 
requirement, such as “detect a pedestrian on the road”, would need to be refined 
to define which characteristics constitute a pedestrian and from which distance and 
level of occlusion pedestrians should be detected. This process of requirements elic- 
itation should include the consideration of a number of stakeholder perspectives and 
sources of data. This could include current accident statistics and the consideration 
of ethical guidelines for AI as proposed by the European Commission [Ind19]. It is 
to be expected that for any particular automated driving function, a number of such 
contracts would be derived to define different properties of the ML function related 
to various safety goals, where each contract may also be associated with different 
quantitative acceptance criteria and sets of scenarios in the operating domain. 


Societal, legal, 

1) ethical expectations ***+* 

From societal, legal and ethical expectations to (1) Field-Based 
explicitly formulated system behavioural constraints Domain Analysis/ Validation 

Item Definition 

“From a set of scenario classes to (2) System (5) 

a formalised system-level specification of safe behaviour Specification 
(3) System Test 

‘From system-level safety requirements to Functional System 
functional safety concept t * eieanaa Design 


Technical System 
Architecture Design Hardware / Software 


Hardware / Software Integration and Test 


5 à esign 
Iteration based on an understanding of the technical Desig 
performance potential and limitations of the function Hardware / Software Hardware / Software 


Implementation Module Tests 


“From functional safety concept to technical architecture 


and technical safety requirements on the ML function 


Fig. 1 Iterative steps during the systems engineering process to bridge the semantic gap associated 
with the definition of technical safety requirements on ML functions 
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Table 1 Example safety contract for a pedestrian detection 


Input domain The set of all possible situations within an urban environment in which 
pedestrians could be within the range of the vehicle’s sensors 


Assumptions Safety-relevant pedestrians have a size 50cm < size < 250cm and are 
within a range of 3m < range < 50m 


Guarantees Bounding box accuracy per frame in the sequence shall be within 20cm of 
the ground truth, and the pedestrian must be detected whenever the 
occlusion through other objects is <50%. The confidence score for 
successful classification per frame shall be >70% 


Table 1 contains an example of a (not necessarily complete) safety contract for a 
bounding-box pedestrian detection model. This contract is defined within the scope 
of a system architecture where the quality of the camera signal acting as input to the 
function is either known or can be determined during development and where a sensor 
fusion and monitoring component receives the result of the ML-based detection 
algorithm and performs time-series plausibility checks on the results and compares 
with other sensor modalities such as RaDAR and LiDAR. 

A qualitative approach to defining the assurance targets for the ML function could 
involve determining a suitable combination of development and test methods to be 
applied that are assumed to lead to a correct implementation of the function. However, 
due to typical failure modes and performance limitations of ML, an absolute level 
of correctness in the function is infeasible. Instead, quantitative assurance targets, 
in line with the requirements of ISO/DIS 21448, are required that would define 
an acceptable limit to the probability that guarantees cannot be met. For example, 
an acceptance criterion for pedestrian detection based on the remaining accuracy 
rate (RAR) metric [HSRW20] could be formulated as “An RAR of 95% is achieved 
and residual errors are distributed equally across all scenarios”, leading to the 
probability of a single pedestrian being undetected by both the ML function and the 
sensor fusion/monitor component being sufficiently low. The remaining accuracy 
rate is defined in [HSRW20] as the proportion of inputs where a confidence score 
above a certain threshold (e.g. 70%) leads to a true positive detection. Thus, the 
definition of the parameters for the quantitative acceptance targets must be defined 
based on a detailed understanding of the performance limits of both the ML function 
and the capabilities of the sensor fusion/monitoring component. 


3 Understanding the Contribution of Safety Evidence 


3.1 A Causal Model of Machine Learning Insufficiencies 


Ideally, to demonstrate that the ML function meets its performance requirements, the 
probability of failure will be directly measured, for example, based on a sufficiently 
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Fig. 2. Causal model of ML-related SOTIF risks demonstrating the relationships between system 
failures, machine learning insufficiencies, their causes, and exacerbating factors as well as the role 
of control measures to minimise the probability of hazardous system failures 


large number of representative tests. However, this approach is feasible only for trivial 
examples. Therefore, multiple sources of evidence are required to reason about the 
presence of performance limitations and the ability of the system to detect and control 
such failures during operation. 

The dependability model of Avizienis et al. [ALRLO4] forms the foundation of 
many safety analysis techniques. In this model, the risk of a system violating its 
objectives is determined by analysing the propagation path from causes of individual 
faults in the system that could lead to an erroneous system state which in turn leads 
to a failure to maintain the system’s objectives. Risk can thus be controlled by either 
eliminating the causes of faults or by preventing them from leading to potentially 
hazardous erroneous states of the system. 

Figure 2 summarises how this approach to causal analysis can be applied to the 
problem of determining the safety-related performance of an ML function and is 
inspired by the safer complex systems framework presented in [BMGW21]. A sys- 
tem failure can be defined as a condition where the safety contract as defined in 
Table 1 is not met. An erroneous state leading to such a failure would relate to the 
presence of insufficiencies in the machine learning function which, in turn, could 
be caused by either faults in its execution or limitations in the training data. Exam- 
ples for the latter are scalable oversight [AOS+16], uncertainty in the input domain, 
such as distributional shift [AOS+16], or the inherent inability of the ML approach 
to accurately represent the target function. Measures to improve the safety of the 
function can be categorised into those applied during design time to reduce the prob- 
ability of insufficiencies in the trained model and those applied during operation in 
order to reduce the impact of residual insufficiencies. Both types of controls may 
be undermined by exacerbating factors specific to the system context, for example, 
the difficulty in collecting balanced and sufficiently complete training data due to 
the scarcity of critical scenarios or the difficulty of developing effective monitor- 
ing approaches due to the need for safe-operational fallback concepts. This model 
of causality for ML-related safety of the intended functionality (SOTIF) risks is 
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now used to determine the following contributions of safety evidence that will be 
illustrated in the next subsection. 


3.2 Categories of Safety Evidence 


Based on the causal model of SOTIF-related risk introduced above, the following four 
categories of evidence can now be defined. This category of evidence corresponds to 
development work products and results of verification and validation activities that 
support the argument of an acceptably low level of residual risk associated with ML 
insufficiencies: 

(1) Confirmation of residual failure rates: This category of evidence provides a 
direct estimate of the performance of the machine-learning component, e.g. in terms 
of false negative rates, or the intersection-over-union (IoU) related to bounding box 
accuracy. However, due to non-linear behaviour in the function and the limited num- 
ber of samples that can be realistically tested, such evidence is unlikely to provide a 
statistically relevant estimation. This category of evidence must, therefore, be used in 
combination with other measures that increase confidence in the ability to extrapolate 
the results to the entire input space that fulfils the contract’s assumptions. 

(2) Evaluation of insufficiencies: This category of evidence is used to directly 
indicate the presence or absence of specific ML insufficiencies that could not only 
lead to a violation of the safety contract, but could also undermine confidence in 
the category | evidence. The properties that would be evaluated by this category of 
evidence would include prediction uncertainty, generalisation, brittleness, fairness, 
and explainability. 

(3) Evaluation of the effectiveness of design-time controls: This category of 
evidence is used to argue the effectiveness of design-time measures to increase the 
performance of the function or to reduce the presence of insufficiencies. In many 
cases, a direct correlation between the design-time measures and the performance of 
the function may not be measured, leading to qualitative arguments for the effective- 
ness of the measures. However, metrics can be used to measure the level of rigour 
by which the measures have been applied. These could include properties of training 
data and test data, or measures to increase the explainability of the trained model. 

(4) Evaluation of the effectiveness of operation-time controls: This category of 
evidence is used to demonstrate the effectiveness of operation-time measures to either 
increase the performance of the function or to reduce the impact of insufficiencies. 
Examples of such could be a measurement of the effectiveness of out-of-distribution 
detection to identify inputs that are outside of the training distribution (and beyond 
the assumptions of the safety contract) such that they can be discarded. 

In the following sections, we illustrate each category with specific examples and 
evaluate the conditions under which the evidence can provide a significant contribu- 
tion to the assurance case. 
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4 Evidence Workstreams—Empowering Experts 
from Safety Engineering and ML to Produce Measures 
and Evidence 


In order to identify, develop, and evaluate effective evidence related to the vari- 
ous categories defined above, we propose to follow the procedure of so-called evi- 
dence workstreams: experts from machine learning, safety, testing, and data working 
together according to the procedure depicted in Fig. 3. The main purpose of the pro- 
cess is to help demonstrate the effectiveness of the methods, metrics, and measures 
to be applied more generally. Related processes are also found in current literature. 
Picardi et al. [PPH+20] propose an engineering process to generate evidence at each 
stage of the ML life cycle. Apart from the process in general, different patterns are 
described that can be instantiated during the ML life cycle. Furthermore, a three-layer 
process model is described by McDermid and Jia [MJ21], featuring a collaborative 
model to bring together the expertise from the ML and safety domain to generate 
evidence for the assurance case. In this work, there are some case studies, but detailed 
steps for the process are missing. 

The evidence workstreams proposed in this chapter enable a joint cooperation of 
different competencies and bring together new innovative methods and structured 
approaches rather than having separate work on each topic. An important key to the 
successful creation of an assurance case is the continuous interaction of the different 
contributors. Before explaining the procedure, we, therefore, define four roles for the 
contributors: 


e The method developer is responsible to implement measures that mitigate spe- 
cific insufficiencies of an ML component. In addition, this role specifies data 
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Fig. 3 Flowchart of the evidence workstream process 
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requirements for the implemented measures and defines metrics for evaluating the 
effectiveness of the measure. 

e Apart from the method developer specifying data requirements, a data engineer 
is responsible for creating and managing the dataset throughout the development 
process as well as performing data analyses to estimate the data coverage. 

e In order to test the effectiveness of the measures, the test engineer develops test 
specifications including required test data and metrics. These tests should then 
be performed by the method developer to demonstrate the effectiveness of the 
method. 

e Finally, the safety engineer ensures that the developed measures and respective 
test results are valid and generate the evidence for the assurance case. 


The process starts by creating and maintaining a catalogue of potential design-time 
and run-time measures to mitigate identified insufficiencies, which can be known 
a priori (see Sect.5.2) or identified during the development phase. Here, method 
developers and safety engineers are involved in the maintenance and also perform 
a prioritisation of measures for each insufficiency. After the selection of measures, 
the test engineer designs and implements effectiveness tests. Furthermore, safety- 
aware metrics for the test are defined by the method developer and safety engineer, 
in addition to other criteria for the evaluation. Importantly, the data engineer needs 
to evaluate the database in order to allow for a statistical significance analysis for the 
designed tests. Afterwards, the prioritised measures are implemented following an 
evaluation and comparison according to the specified tests. Once the measures show 
sufficient effectiveness, they can be added to the system development and the safety 
engineer can use the test results as evidence for the assurance case. If a measure was 
not effective, then we propose the two following options: 


e Firstly, measures can be combined and optimised to increase the effectiveness. 
After combination, a re-evaluation is performed. 

e Secondly, new measures can be picked from the catalogue and the test design and 
implementation step is repeated. 


Additionally, an iteration of the safety requirement derivation and system archi- 
tecture design process could lead to recommendations for additional components or 
adjusted specifications in order to mitigate remaining insufficiencies. 


5 Examples for Evidence 


This section discusses examples for the different categories of safety evidence intro- 
duced in Sect. 3.2, which can be obtained by the process introduced in Sect. 4. 
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5.1 Evidence from Confirmation of Residual Failure Rates 


The quantification of model performance or residual failure rates of a machine 
learning (ML) component is a crucial step in any ML-based system, especially for 
safety-critical applications. Typically, the performance of ML components is mea- 
sured by metrics that calculate a mean performance over a specific test dataset. 
In [PNdS20] and [PPD+21], commonly used metrics for object detection are com- 
pared, including mean average precision (mAP) and mean intersection-over-union 
(mloU). However, for safety-critical systems, only evaluating the mean performance 
is not sufficient. On the one hand, there are unknown risks introduced by residual 
failure rates, on the other hand, failures are not weighted by their relevance. To 
counteract this, for example, in [HSRW20], safety-aware metrics are proposed in 
the literature to incorporate uncertainty to evaluate the remaining accuracy rate and 
remaining error rate. In [CKL21], safety-aware metrics for semantic segmentation 
are proposed, putting emphasis on perturbation and relevance. Volk et al. [VGvBB20] 
propose a safety metric that incorporates object velocity, orientation, distance, size, 
and a damage potential. Furthermore, ISO/TR 4804 [ISO20] proposes a generic con- 
cept, whereby safety-aware metrics need to be evaluated for perception components 
realised with deep neural networks. In addition, it should be noted that the valid- 
ity of metrics depends strongly on the utilised dataset. In general, the creation of 
safety-aware metrics is still an open field in the domain of automated driving and for 
perception components in particular. Based on the references and insights above, the 
following considerations should be taken into account to use performance metrics 
as evidence: 


e Mean performance metrics should be evaluated within a specified input space. 

e Statistical significance of the metrics value should be evaluated and shown by 
measuring dataset coverage. 

e The failure rate related to specific classes of errors should be evaluated, including 
an analysis of their causes. 


The evidence for performance or accuracy evaluation could be in the form of a test 
report, where the specification of the input space and the data coverage evaluation 
is defined. Apart from detection or classification accuracy, localisation precision is 
also an important factor and is often dependent on the accuracy. In addition to a 
summary of the safety-aware metrics, an analysis of various classes of errors, their 
potential causes and impact within the system should be included in the test report, 
allowing for a systematic evaluation of the residual risk associated with the function 
at a system level. 

The following example provides an intuition of conditions that have to be met so 
that a performance metric can be used as evidence convincingly. The basis of nearly 
every performance metric is the calculation of the confusion matrix, containing the 
number of true positives (TP), false positives (FP), false negatives (FN), and true 
negatives (TN) rates. Here, we consider the case of a 2D bounding box pedestrian 
detection on images for automated driving. In this case, the number of TNs is not 
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reasonable since there will be a nearly infinite amount of those bounding boxes. In 
order to compute the confusion matrix, the intersection-over-union (IoU) metric is 
typically used to determine if a detection can be considered as a TP, FP, or FN by 
using a threshold. This threshold, on the one hand, defines the localisation error per 
object which must not be exceeded for the detection to be counted as TP, and, on the 
other hand, also influences the mean accuracy on a test dataset. Therefore, in the test 
report, there should be an argument about why a certain threshold has been chosen. 
Furthermore, typically FP and FN are weighted equally, but for a safety analysis, 
FN might be more important and not all FN might be relevant. Hence, the test report 
should define relevant FN or FP based on the system architecture and assigned 
component requirements by introducing specific weights. Of course, performance 
metrics alone cannot give enough evidence to argue the safety of a perception system 
for automated driving. In addition, the performance needs to be evaluated in case of 
rarely occurring situations, and also the robustness against perturbations needs to be 
considered. 


5.2 Evidence from Evaluation of Insufficiencies 


An essential feature in the development of deep neural networks during training 
lies in the purely data-driven parameter fitting process without expert intervention: 
The deviation of the output (for a given parameterisation) of a neural network from 
a ground truth is measured. The loss function is chosen in such a way that the 
parameters depend on it in a differentiable way. As part of the gradient descent 
algorithm, the parameters of the network are adjusted in each training step depending 
on the derivative of the deviation (backpropagation). These training steps are repeated 
until some stopping criterion is satisfied. In this procedure, the model parameters are 
determined without an expert assessment or semantically motivated modelling. This 
has significant consequences for the properties of the neural network: 


e Deep neural networks (DNNs) are largely opaque for humans and their calculations 
cannot be interpreted. This represents a massive limitation for systematic testing 
or formal verification. 

e Deep neural networks are susceptible to harmful interference: Perturbations could 
be manually induced changes in the data (adversarial examples) or real-word cor- 
ruptions (e.g. sensor noise, weather influences, certain colours, or contrasts by 
sensor degeneration). 

e It is unclear to which input characteristics an algorithm sensitises. The execution 
of neural networks in another domain (training in summer, execution in winter, 
etc.) sometimes reduces the functional quality dramatically. 


This leads to DNN-specific insufficiencies and safety concerns as described in detail 
by Willers et al. [WSRA20], Samann et al. [SSH20], and Schwalbe et al. [SKS+20]. 
In our evidence strategy, we utilise metrics to evaluate the insufficiencies and use 
specific values and justified bounds as specific evidence. In the following, this proce- 
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Augmented Image Baseline Segmentation Defended Segmentation 


Fig. 4 Visualisation of the robustness of a semantic segmentation model: Rain augmentation via 
Hendryck’s augmentations [HD19] on Cityscapes [COR+16] (left), baseline performance on the 
corrupted image (middle) and performance of a robustified model via AugMix with Jensen—Shannon 
divergence (JSD) loss [HMC+20] (right) 


dure is shortly explained for the example of brittleness of DNNs. The strategies for 
the other insufficiencies follow basically the same pattern, but the elaboration would 
go beyond the scope of this chapter. 


Measuring robustness of deep neural networks (DNNs): Regarding brittleness, 
the specific evidence that we want to define are test results achieving the required 
performance even under reasonable perturbations. If the required performance is 
achieved for all required conditions, we call the DNN “robust”. To evaluate the 
robustness with respect to real-world corruptions, we suggest to use both, recorded 
and tagged real-word effects as well as augmented perturbations such as noise, blur, 
colour shift, etc., via Hendryck’s augmentations [HD 19] as depicted in Fig. 4. The 
required parameters for the test conditions are extracted from operational design 
domain (ODD), sensor, and data analysis. 


5.3 Evidence from Design-Time Controls 


Design-time controls systematically complement and integrate evidence gained via 
overall performance considerations or regarding specific insufficiencies. The major 
goal is to integrate the various aspects and metrics that have to be considered into 
a holistic view, allowing the AI developer, firstly to incorporate measures that are 
overall effective, such as architectural choices, and parameter optimisations, or data 
coverage measures, and secondly, to handle potential trade-offs between different 
optimisation targets and requirements. It is important to note that the design-time 
controls are to be applied in an iterative workflow during development, aiming to 
provide sufficient evidence for the overall safety argumentation. As a specific exam- 
ple of a design-time control, we sketch how developers and safety engineers can be 
supported by a visual analytics tool during the design and development phase. 


Understanding DNN predictions via visual analytics: An important contribution 
to a complete safety argumentation can be made via methods that support humans 
in understanding and analysing why an AI system is coming to a specific decision. 
Insights into the inner operations of a DNN can increase trust into the entire applica- 
tion of that neural network. However, in a safety argumentation, it is not sufficient to 
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High metric values 


Low metric values 


Fig. 5 Landscape of DNN performance over semantic and non-semantic dimensions 


“explain” single decisions of a network, for example, in some kind of “post analysis” 
in case of an accident, which might have been caused by a failure of a DNN-based 
pedestrian detection, it should be possible to provide arguments about the under- 
standing of the inner operations of a neural network considering a huge set of input 
test data at design time. The obvious solution for achieving this would be completely 
replacing the DNN with an interpretable model. However, this often is not feasible 
considering the trade-off with overall performance of the network. Ideally, such a 
“design-time understanding” of the overall behaviour of a DNN would result in an 
understanding of the network performance over a complete landscape spanned by 
semantic and non-semantic dimensions of the test data as illustrated in Fig. 5. 

Figure 5 shows an idealised view of the performance of the DNN in a selected 
relevant metric, see explanations in Sect.5.1, per input test data point over selected 
semantic and non-semantic dimensions. In the case of pedestrian detection in the 
automotive context, semantic dimensions may refer to the semantics of a scene 
description in the operational design domain, such as pedestrian attributes or envi- 
ronmental conditions, while non-semantic dimensions may refer to technical image 
effects, such as blurring or low contrast, which are supposed to have an influence 
on the performance of the DNN. Such a view can support the human to identify 
systematic weaknesses of a DNN in terms of human-understandable dimensions. To 
be able to create a view as depicted in Fig.5, a human user, either in the role of a 
DNN developer, safety engineer, or auditor, is faced with two challenges: 


e The sheer amount of test data needed to reach significant insights usually exceeds 
the cognitive capacity of the human brain. 
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e Finding and selecting the relevant dimensions that actually influence the network 
performance among all possible dimensions is far from being trivial, especially 
taking into account that the performance of a DNN mostly depends on a combina- 
tion of different dimensions and usually cannot be explained by considering one 
dimension alone. 


In order to overcome these two challenges, tool support is needed. In [SMHA21], 
an approach is described to support the human user in finding and evaluating machine 
learning models by means of visual analytics. In this approach, tool support is pro- 
vided to enable the user to perform visual analyses over huge amounts of data in such 
a way that a quick and intuitive understanding of the nature of the underlying data and 
network performance can be gained. Such an initial understanding usually enables 
the user to create hypotheses about the relevance of semantic and non-semantic 
dimensions, which then can be checked or refined interactively. Human understand- 
ing about semantics helps in a semantic understanding of the DNN in an iterative 
process. 

Figures 6 and 7 show a snapshot of the visual analytics tool being used to inves- 
tigate the performance of a DNN for pedestrian detection based on a semantic seg- 
mentation of images. The so-called metadata that indicates the values regarding a 
specific dimension of an input image (such as number of pedestrians, brightness of 
the image, and overall detection performance in the image) and the values regard- 
ing specific pedestrians in the image (such as size, position in the image, detection 
performance for this pedestrian) is listed in a tabular interface as illustrated in Fig. 6. 
The user can issue arbitrary queries against this table to select interesting subsets of 
the data. The plots shown in Fig.7 give an additional overview of various attributes. 
Selected images, predictions, or ground truth can be visualised as shown in Fig. 9. 

Figure 8 shows an exemplary scatterplot of the pixel size of a pedestrian in an 
image versus the detection performance measured in IoU (intersection-over-union) 
for this pedestrian. The goal is to find out whether certain aspects, such as pixel size 
of the pedestrian in the image, have an effect on the detection performance of the 
underlying DNN. It can be seen that there is a general tendency for pedestrians with 
larger height (more to the left) to achieve higher (better) detection performance. How- 
ever, there exist some large pedestrians with relatively low detection performance 
(as highlighted by the selection box in Fig.8). The visual analytics tool allows for 
interactively selecting such sets of interesting points from a plot for further analysis 
just by drawing a selection box around the points of interest. Selected images will 
then be visualised in the drill down view of the tool shown in Fig. 9. In this particular 
example, some of the selected images were showing “pedestrians” riding bicycles 
as depicted in Fig. 9b). The human analyst could now use the visual analytics tool to 
further investigate the hypothesis that cyclists are not detected well enough by the 
DNN. 


S. Burton et al. 


350 


EHHH 


gwp vyu uo JAU AJanb Ie[nge) :səssouyeom pue ovueunopod NNA JuUrsAeue 10} 100} onÁjeue penstA 9'3 


AL Paoro enoo o emmo war 


eui raogo woo imao ED o w a wy 
ua roco sooo eneo mo 0 w a ee ey 
eua rioga ooo emmo mep 0 w a we ee 
su PHOTO waooo ieme mo o w e wu 
oui rogo eooo Mew mo 0 “w e aum 
Oni POTO EROS emmo meo ° w = = ua 
oma mogo DEROOS MeO mo 0 m a ee eee 
oui roro SERCOe eS mo o “w e â “umwe 
oua raono ewoo MO mo o w z ā auw 

mo oom we 

mo oom we 


om raono DERROO® emmo caer 


OU ESOTO TELMO STRLELO MIIO KONGO [ZENS LLICHIO LOITO 90400 
OUE ESOPO ELMO FENELO MIPLEO MONGO (IZEOFO “reco “LENSE 2900400 
OUE OSORO ELMO FENLLO MZPLED MODES IIZHAFO LICHO 2LOPSZ O 5390400 
ORK OPORO ELELO FINALO MEEPLED IONE TAZEETO ~LECHOD LLOIST O 1000000 
OOC PHONO ELECO KILO MEPLEO IONO ITHOFO SIICHO LLOIGTO 2300000 
OU PHONO HKEELO FOMLLO MTPLED MOODLE TAZEETO “LECHOO LLOPETO 2390000 
OUE SHON HKLELO SOMALO MZPLED MORLEO IZSOFO “LECOD LLOIEZ O 2300000 
ORE PORO IELALO SOMALO MEERUT IOLE HITHOGO “LECHOO t2eRETO S390400 
OW POH ICLELO SENALO MZPLEO WOALEO II200FO LIICHT LOISTO 1000400 
OU OSORO ELMLO FEMLLO MZPLED WOOLEO TIZEETO LICHO LOISTO 1900400 
OU OSOVO ELELO FENLLO MERLE MOLE O CIZHOFO LICHO LLOIET O 1300400 


OQ HH THER, MUI uesta SAV voinu Ornat Eata Temata uea MERION Aa a GNC Dupont weeps pentan ay id 


EEEEEEEEEEE 


BPPRRRFF575 


MON recano asoro DONS 
NEN eano ereeso ONDE 
NON recano seeoso osare 
New ceero ONE 
NAN remo Oro OPNE 
NAN PTET DOESODO OPPE 
Men cortege itoeeoo 0009 
Nen eeano ostora Gont 
MeN ewo GRNECO Cori 
NAN reao O torro ORPI 
mnno eo oon 
bun betan * om 


ow 
otw owe oir 
owa oi r 
oa oms mo 


Oe ori mor igp oyee ion 
Oe ome moe eoeou pyst oo 
omo ooe pere cecpeumew ipost sm 
owo ome ote serpen ioyer on 
Or Ouri ce sapeu iese ioe 
oco ose eze PITY CURE ioyse ome 
Omo oume piye opeunmew ipot sin 
poo poom "nunna rowers pia a 


(Or Launo unasapadup) y Laug eL OOOU Nu) 
nda Aang) 
a o 


351 


Safety Assurance of Machine Learning for Perception Functions 


samquye p398 Jo sjoyd :sassouyeom pue sdueullojiod NN Sulsdyeur 107 100} oATeue enst L “SLT 


352 S. Burton et al. 


loU 
per 
pedestrian 


Height of the pedestrian in the image in pixeis 


Fig. 8 Plot of the pedestrian detection performance (loU) in dependence of the height of the 
pedestrian in the image in pixels 


(a) Overlay of input image and (b) Overlay of ground truth (left) and predictions (right): selected 
ground truth critical images often contain cyclists 


Fig. 9 Visual analytic tool for analysing DNN performance and weaknesses: drill down view 
showing a a sample image, and b critical images selected from the correlation plot shown in Fig. 8 
(Images: BIT Technology Solutions) 


5.4 Evidence from Operation-Time Controls 


Operation-time controls are typically used to mitigate unacceptable risk occurring 
during run-time of the system. For example, when the performance of a percep- 
tion component for automated driving degenerates, controls could become active 
to mitigate the degeneration by handing vehicle controls back to a human driver or 
by using alternative perception components. The operation-time controls are par- 
ticularly important for automated driving, since such systems operate in the real 
world. As during design time, not all possibly occurring situations can be foreseen, 
architectural solutions are required. 

In this section, we give two examples of how evidence from operation-time con- 
trols can be derived. The first example deals with out-of-distribution detection and 
the second one goes into detail for uncertainty estimation. 


Out-of-distribution detection: One of the major causes of ML insufficiencies is 
the fact that the data distribution of an operational domain is unknown. Therefore, 
it is only possible to sample from this unknown distribution and approximate this 
distribution by statistical approaches or machine learning (ML) techniques. During 
the construction of the safety contract, a certain input domain is defined at a semantic 
level, which directly leads to a gap to the real data distribution of the operational 
domain. One of the resulting ML insufficiencies is that there is an unknown behaviour 
when the ML function is presented with samples that are not within the training data 
distribution. Therefore, a requirement could be that leaving the operational design 
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domain (ODD) has to be detected at run-time. Such a requirement could introduce 
a monitoring component using complementary methods. Three example methods 
could be: 


e When the ODD defines certain areas of operation, e.g. urban intersections, locali- 
sation information in combination with map data can be utilised to detect leaving 
the ODD. 

e Some sensors can directly measure an environmental state, such as a rain sensor, 
which can be used to set bounds according to the ODD to detect whether the ODD 
is left. 

e Out-of-distribution methods [HG17, XYA20] can be used to detect a distributional 
shift of the input samples during run-time with respect to the database used for 
training a ML model for perception or planning. 


For the last method, there could furthermore be two types of evidence. Firstly, the 
effectiveness of the methods should be shown by a test report including an evaluation 
of a metric that indicates the separation precision between in- and out-of-distribution 
samples. Secondly, it has to be shown that the training and test data used to train the 
out-of-distribution detector is sufficient. 


Uncertainty estimation: Uncertainty estimation methods can also be applied as 
operation-time controls. Uncertainty is an inherent property of any machine learning 
model, which may result from either aleatoric sources, such as a non-deterministic 
behaviour of the real world or issues in the data or labelling, or epistemic sources, 
referring to the inherent limitations of the machine learning model itself. Uncertainty 
estimation methods aim to enable the DNN to indicate the uncertainty related to a 
specific DNN prediction given a specific input at run-time. This may also contribute 
to out-of-distribution detection under the assumption that the uncertainty increases 
when the DNN is applied outside of its training data distribution. This assumption 
of course must be rigorously tested in order to provide the corresponding evidence. 
Among popular uncertainty estimation methods, methods based on Monte-Carlo 
(MC) dropout [GG16] receive particular attention in embedded applications as they 
approximate the performance of Bayesian Networks and hence usually outperform 
techniques purely based on post-processing calibration, and come with an acceptable 
run-time overhead (compared to, e.g. full Bayesian networks or deep ensembles). 
Although originally intended for capturing epistemic uncertainty only, it can be 
extended to capture aleatoric uncertainty as well [KG17, SAP+21]. 


6 Combining Safety Evidence in the Assurance Case 


While the insights sketched above surely help to understand weaknesses and improve 
the development of the DNN under investigation, the step towards arguing evidence in 
an assurance case still has to be performed. According to the principles of ISO 26262, 
ISO/DIS 21448, and ISO/TR 4804, the assurance case shall state in a convincing 
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Safety Goal 


Safety DNN related 
Requirements Safety Concerns 


Evidence Strategy Metrics, Methods 


Measures 
Evidence (DNN, Architecture, 
Tooling, Testing, other) 


Context, Goal, 
Safety Assumption, Strategy, 
Justification, Evidence 


Argumentation 


Goal Structuring 
Notation (GSN) 


Fig. 10 Assurance case development: Integration of evidence into an assurance case 


way: “The system is safe because...”. However, the assurance of DNNs leads to 
several problems, since this technology requires new paradigms in development. 
The software is no longer explicitly developed. Instead, the neural network is trained 
and the network’s behaviour is implicitly influenced by the training models and data. 
The combination of safety evidence in the assurance case provides central elements 
for a holistic assurance strategy. 

The core aspect of safety argumentation is to show that the mitigation of insuf- 
ficiencies was successful. If the insufficiency is reduced to an acceptable level, this 
provides evidence to be used in the safety argumentation. As shown in Fig. 10, one 
possibility to combine safety evidence in the assurance case is to start on the top 
level with the definition of safety goals. These are goals that define mandatory steps 
to avoid hazards. Then the safety requirements are refined step-by-step based on 
the described causal model of SOTIF-related risk using the categories of evidence. 
This is supported by considering DNN-related safety concerns. Moreover, several 
metrics are defined to show the effectiveness of measures that mitigate the effects 
of insufficiencies. The goal structuring notation (GSN) can be used to assemble evi- 
dence collected from various methods as they were presented in Sects. 5.1, 5.2, 5.3, 
and 5.4 to provide a structured overall safety argumentation. The GSN visualises 
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the elements of the safety argumentation. An assurance case can be presented in 
a clear and structured way in the GSN. A GSN tree consists of three central ele- 
ments: the argumentation goal, the description of the argumentation strategy, and 
the evidence. These three elements are supported by assumptions, justifications, and 
context information. A central aspect is the iterative nature of this technique to refine 
understanding of insufficiencies in the function. Further iterations are started on the 
top level (definition of safety goals). 


7 Conclusions 


Machine learning is an enabler technology for automated driving, especially for, 
but not limited to, the required perception components. However, assuring safety 
for such components is a challenge as standard safety-argumentation concepts are 
not sufficient to capture the inherent complexity and data-dependency of functions 
based on machine learning. For example, a component realised with an ML function 
is formally hard to describe, which in turn makes it especially difficult to define 
appropriate evidence for the safety argumentation. Therefore, we introduce different 
types of evidence as a structuring guidance: evidence from confirmation of residual 
failure rates, evidence from evaluation of insufficiencies, evidence from design-time 
controls, and evidence from operation-time controls. 

In order to create appropriate evidence, a process is required to ensure that knowl- 
edge from different domains (machine learning, safety, and testing) are brought 
together. Therefore, we propose the process of evidence workstreams to define evi- 
dence in a structured way. Furthermore, we show how to integrate evidence into an 
assurance case. 

One of the main questions still remains open: Can a convincing assurance case be 
constructed? We argue “yes” but only by explicitly acknowledging the insufficien- 
cies in the ML function within the system design, and by being able to determine the 
residual failures with sufficient confidence. This implies directly defining an accept- 
able residual risk that is acknowledged by social acceptance and legal conditions, 
which is an open challenge. 

Future research topics might be how to combine multiple quantitative and quali- 
tative pieces of evidence into the safety argumentation w.r.t. a given system architec- 
ture and how to balance them. Moreover, there is demand for derivation of evidence 
from the appropriate coverage of all possible situations within the operational design 
domain (ODD) based on a structured ground context including tests. Further, the 
iterative and dynamic process of constructing the assurance case (or “continuous 
assurance”) requires work on formal models of the assurance case and on the con- 
tinuous evaluation of the assurance case. 
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Oliver Grau, Korbinian Hagn, and Qutub Syed Sha 


Abstract This chapter introduces a novel data synthesis framework for validation 
of perception functions based on machine learning to ensure the safety and func- 
tionality of these systems, specifically in the context of automated driving. The main 
contributions are the introduction of a generative, parametric description of three- 
dimensional scenarios in a validation parameter space, and layered scene generation 
process to reduce the computational effort. Specifically, we combine a module for 
probabilistic scene generation, a variation engine for scene parameters, and a more 
realistic sensor artifacts simulation. The work demonstrates the effectiveness of the 
framework for the perception of pedestrians in urban environments based on various 
deep neural networks (DNNs) for semantic segmentation and object detection. Our 
approach allows a systematic evaluation of a high number of different objects and 
combined with our variational approach we can effectively simulate and test a wide 
range of additional conditions as, e.g., various illuminations. We can demonstrate 
that our generative approach produces a better approximation of the spatial object 
distribution to real datasets, compared to hand-crafted 3D scenes. 


1 Introduction 


This chapter introduces an automated data synthesis approach for the validation of 
perception functions based on a generative and parameterized synthetic data genera- 
tion. We introduce a multi-stage strategy to sample the input domain of the possible 
generative scenario and sensor space and discuss techniques to reduce the required 
vast amount of computational effort. This concept is an extension and generaliza- 


O. Grau (È<) - K. Hagn - Q. Syed Sha 
Intel Deutschland GmbH, LilienthalstraBe 15, 85579 Neubiberg, Germany 
e-mail: oliver.grau @intel.com 


K. Hagn 
e-mail: korbinian.hagn @intel.com 


Q. Syed Sha 
e-mail: syed.qutub @ intel.com 


© The Author(s) 2022 359 
T. Fingscheidt et al. (eds.), Deep Neural Networks and Data for Automated Driving, 
https://doi.org/10.1007/978-3-031-01233-4_13 


360 O. Grau et al. 


tion of our previous work on parameterization of the scene parameters of concrete 
scenarios, called validation parameter space (VPS) [SGH20]. We extend this param- 
eterization by a probabilistic scene generator to widen the coverage of the generated 
scenarios and a more realistic sensor simulation, which also allows to variate and 
simulate different sensor characteristics. This ‘deep’ synthesis concept overcomes 
currently available systems (as discussed in the next section) or manually, i.e., by 
human-operator-generated synthetic data. We describe, how our synthetic data val- 
idation engine makes use of the parameterized, generative content to implement a 
tool supporting complex and effective validation strategies. 

Perception is one of the hardest problems to solve in any automated system. 
Recently, great progress has been made in applying machine learning techniques 
to deep neural networks to solve perceptional problems. Automated vehicles (AVs) 
are a recent focus as an important application of perception from cameras and other 
sensors, such as LIDAR and RaDAR [YLCT20]. Although the current main effort 
is on developing the hardware and software to implement the functionality of AVs, 
it will be equally important to demonstrate that this technology is safe. Universally 
accepted methodologies for validating safety of machine learning-based systems are 
still an open research topic. 

Techniques to capture and render models of the real world have matured sig- 
nificantly over the last decades and are now able to synthesize virtual scenes in a 
visual quality that is hard to distinguish from real photographs for human observers. 
Computer-generated imagery (CGI) is increasingly popular for training and valida- 
tion of deep neural networks (DNNs) (see, e.g., [RHK17, Nik19]). Synthetic data 
can avoid privacy issues found with recordings of members of the public and can 
automatically produce ground truth data at higher quality and reliability than costly 
manually labeled data. Moreover, simulations allow synthesis of rare scene constella- 
tions helping validation of products targeting safety-critical applications, specifically 
automated driving. 

Due to the progress in visual and multi-sensor synthesis, building systems for 
validation of these complex systems in the data center becomes feasible now and 
offers more possibilities for the integration of intelligent techniques in the engineer- 
ing process of complex applications. We compare our approach with methods and 
strategies targeting testing of automated driving [JWKW 18]. 

The remainder of this chapter is structured as follows: The next section will give 
an outline of related work in the field. In Sect. 3 we give an overview of our approach. 
Section4 describes an outline of our synthetic data validation engine, our parame- 
terization, including a realistic sensor simulation, and the effective computation of 
the required variations. In Sect.5 we present evaluation results, followed by Sect. 6 
with some concluding remarks. 
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2 Related Work 


The use of synthesized data for development and validation is an accepted tech- 
nique and has been also suggested for computer vision applications (e.g., [BB95]). 
Several methodologies for verification and validation of AVs have been developed 
[KP16, JWKW18, DG18] and commercial options exist.! These tools were origi- 
nally designed for virtual testing of automotive functions, such as braking systems, 
and then extended to provide simulation and management tools for virtual test drives 
in virtual environments. They provide real-time-capable models for vehicles, roads, 
drivers, and traffic which are then being used to generate test (sensor) data as well as 
APIs for users to integrate the virtual simulation into their own validation system. 

Whatis getting presented in this chapter is focusing on the validation of perception 
functions, which is an essential module of automated systems. However, by sepa- 
rating the perception as a component, the validation problem can also be decoupled 
from the validation of the full driving stack. Moreover, this separation allows, on the 
one hand, the implementation of various more specialized validation strategies and, 
on the other hand, there is no need to simulate dynamic actors and the connected 
problem of interrelations between them and the ego-vehicle. The full interaction of 
objects is targeted by upcoming standards like OpenScenario.” 

Recently, specifically in the domain of driving scenarios, game engines have been 
adopted for synthetic data generation by extraction of in-game images and labels 
from the rendering pipeline [WEG+00, RVRK16]. Another virtual simulator system, 
which gained popularity in the research community, is CARLA [DRC+17], also 
based on a commercial game engine (Unreal4 [Epi04]). Although game engines 
provide a good starting point to simulate environments, they usually only offer a 
closed rendering setup with many trade-offs balancing between real-time constraints 
and a subjectively good visual appearance to human observers. Specifically, the 
lighting computation in this rendering pipelines is limited and does not produce 
physically correct imagery. Instead, game engines only deliver fixed rendering quality 
typically with 8 bit per RGB color channel and only basic shadow computation. 

In contrast, physical-based rendering techniques have been applied to the genera- 
tion of data for training and validation, as in the Synscapes dataset [WU 18]. For our 
experimental deep synthesis work, we use the physical-based open-source Bl ender 
Cycles renderer’ in high dynamic range (HDR) resolution, which allows realis- 
tic simulation of illumination and sensor characteristics increasing the coverage of 
our synthetic data in terms of scene situations and optical phenomena occurring in 
real-world scenarios. 

The effect of sensor and lens effects on perception performance has not been 
studied a lot. In [CSVJR18, LLFW20], the authors are modeling camera effects to 
improve synthetic data for the task of bounding box detection. Metrics and parameter 
estimation of the effects from real camera images are suggested by [LLFW20] and 


' For example, Carmaker from IPG or PreScan from TASS International. 
? https://www.asam.net/standards/detail/openscenario/. 
3 https://www.blender.org/. 
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[CSVJR19]. A sensor model including sensor noise, lens blur, and chromatic aberra- 
tion was developed based on real datasets [HG21] and integrated into our validation 
framework. 

Looking at virtual scene content, the most recent simulation systems for validation 
of a complete AD system include simulation and testing of the ego-motion of a 
virtual vehicle and its behavior. The used test content or scenarios are therefore 
aimed to simulate environments spanning a huge virtual space and are then virtually 
driving a high number of test miles (or km) in the virtual world provided [MBM 18, 
WPC20, DG18]. Although this might be a good strategy to validate full AD stacks, 
one remaining problem for validation of perception systems is the limited coverage 
of data testing critical scene constellations (sometimes called ‘corner cases’) and 
parameters that lead to drop in performance of the DNN perception. 

A more suitable approach is to use probabilistic grammar systems [DKF20, 
WU18] to generate 3D scenarios which include a catalog of different object classes, 
and places them relative to each other to cover the complexity of the input domain. 
In this chapter we demonstrate the effectiveness of a simple probabilistic grammar 
system together with our previous scene parameter variation [SGH20] with a novel 
multi-stage strategy. This approach allows to systematically test conditions and rel- 
evant parameters for validation of perceptional function in a structured way. 


3 Concept and Overview 


The novelty of the framework introduced in this chapter is the combination of mod- 
ules for parameterized generation and testing of a wide range of scenarios and scene 
parameters as well as sensor parameters. It is tailored towards exploration of factors 
that (hypothetically) define and limit the performance of perception modules. 

A core design feature of the framework is the consequent parameterization of 
the scene composition, scene, and sensor parameters into a validation parameter 
space (VPS) as outlined in Sect. 4.2. This parameterization only considers the near 
proximity of the ego-car or sensor; in other words, only the objects visible to the 
sensor are generated. This allows a much more well-defined test of constellations 
involving a specific number of object types, environment topology (e.g., types and 
dimensions of streets), and relation of objects, usually as an implicit function of 
where objects are positioned relative in the scene. 

This leads to a different data production and simulator pipeline than for conven- 
tional AV validation which typically provides a virtual world with a large extent to 
simulate and test the driving functions down to a physical level, inspired by real-world 
validation and test procedures [KP16, MBM18, DG18, JWKW18, WPC20]. 

Figure | shows the building blocks of our VALERIE system. The system runs 
an expansion of the VPS specified in the ‘validation task’ description. Our current 
implementation is based on a probabilistic description of how to generate the scene 
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Fig. 1 Block diagram of the proposed validation approach 


and defines the parameter variations in the parameter space. In the future, the vali- 
dation task should also include a more abstract target description of the evaluation 
metrics. 

The data synthesis block consists of three sub-components: The probabilistic 
scene generator generates a scene constellation, including a street layout, and places 
three-dimensional objects from the asset database according to probabilistic place- 
ment rules laid out in the scenario preparation. The parameter variation generator 
produces variations of that scene, including sun and light settings and variations of 
the placement of objects (see Fig. 2 for some examples). The sensor & environment 
simulation is using a rendering engine to compute a realistic simulation of the sensor 
impressions. 

Further, ground truth data is provided through the rendering process, which can be 
used for a pixel-accurate depth map (distance from camera to scene object) or meta 
data, like pixel-wise label identifiers of classes or object instances. Depending on 
the perception task, this information is specifically used for training and evaluation 
of semantic segmentation (see Sect. 4.5). 

The output of the sensor simulation is passed to the perception function under test 
and the response to that data is computed. An evaluation metric specific to the valida- 
tion task is based on the perception response. The ground truth data, as generated by 
the rendering process is usually required here, e.g., to compute the similarity to the 
known appearance of objects. In the experiments presented in this chapter we used 
known performance metrics for DNNs, such as the mean intersection-over-union 
(mIoU) metric, as introduced by [EVGW+15]. 

The parameterization along with the computation flow are described in detail in 
the next section. 
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4 VALERIE: Computational Deep Validation 


The goal of validation is usually to demonstrate that the perception function (DNN) 
is performing to a level defined by the validation goal for all cases included and 
specified in the ODD (operational design domain) [SAE18]. 

The framework presented in this contribution supports the validation of perception 
with the data synthesis modules outlined above. Further, we suggest to consider three 
levels or layers, which are supported in our framework: 


1. The scene variation generates 3D scenes using a probabilistic grammar. 

2. The scene parameter variation generates variations of scene parameters, such as 
moving objects or changing the illumination of the scene by changing the time of 
the day. 

3. The sensor variation generates sensor faults and artifacts. 


For an actual DNN validation, an engineer or team would build various scenarios 
and specify variations within these scenarios. The variations can contain lists of 
alternative objects (assets), different topology and poses (including position and 
orientation of objects) and expansions of streets and object and global parameters 
such as direction of the sun. Our modular multi-level approach enables strategies 
to sample the input space, typically by applying scene variation and this can be 
then combined with more in-depth parameter variation runs and variation of sensor 
parameters, as required. 

Specifically, the ability to either combine two or all three levels of our frame- 
work allows to cover a wider range of object and scene constellations. In particular, 
with our integrated asset management approach, a validation engineer can ensure 
that certain, e.g., known critical object classes are included in the validation runs, 
i.e., he can explicitly control the coverage of these object classes. By combination 
with our parameter variation sub-system, local changes are varied, including relative 
positioning of critical objects, the positioning of the ego-vehicle or camera, global 
scene parameters such as the sun angle, etc. can be achieved. 


4.1 Scene Generator 


In computer graphics, a scene is considered as a collection O = {0), 02,...} of 
objects o; and these are usually organized in scene graphs (see, e.g., [Wer94]) and 
this model is also basis for file format specifications to exchange 3D models and 
scenes, e.g., VRML* or gITF.5 

Each object in this graph can have a position and orientation and scale in a (world) 
coordinate system. These are usually combined into a transformation matrix T. Sev- 
eral parameterizations for position and orientations are possible, for the position 


4 ISO/IEC 14772-1:1997 and ISO/IEC 14772-2:2004 https://www.web34d.org. 


5 www.khronos.org/gltf. 
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Table 1 Example placement description (j son format) 


{ "class_id" : "tree", 
"copies" : 5, 
"target" { "yI" + [59:0; 20:0; 0.151, "ya": [89.8,; 72.505. 0.15) J4 
"rot_range" : [0.0, 360.0] 


"asset_list" : [ 
"e649326c-061e-4881-b0a8-£2ad6297124c", 
"3a742848-4166-4256-a045-c4a8f8ef94bc", 
"4e153905-0ed6-4077-be33-74ef4c7509£5" ] } 


usually a Cartesian 3D vector for orientation notations such as Euler angles or quater- 
nions are common. 

Objects 0; are described as geometry, e.g., as a triangular mesh and appearance 
(material). 

Sensors, such as a camera, can also be represented in a scene graph and so can be 
light sources. Both also have a position and orientation, and accordingly, the same 
transformation matrices as for objects can be applied (except scaling). 

The probabilistic scene generator (depicted in Fig. 1) places objects o; according 
to a grammar file that specifies rules for placements and orientations in specific areas 
of the 3D scene. The example json file in Table 1 specifies the placement of tree 
objects in a rectangular area of the scene: The tree objects are randomly drawn from 
a database, the field assets_list specifies a list of possible assets in universally 
unique identifier (UUID) notation. The 3D models in the database are tagged with a 
class_id, which specifies the type of objects, e.g., humans or buildings. The class 
information will be used in the generation of meta-data, semantic class labels, and an 
instance object identifier which allows to determine on a pixel level the originating 
3D object. 

The placement and orientation are determined by a pseudo random number gen- 
erator, with a controlled seed for the ability to exactly re-run experiments if required. 
The scene generator handles other constraints, such as specific target densities in 
specific areas, distant ranges between objects, and it finally checks that the placed 
object neither collides nor intersects with other objects in the scene. 

The street base is also part of the input description for the scene generator. A street 
can be varied in width, type of crossings, and textures for street and sidewalk. In the 
current simplistic implementation, the street base is limited to a flat ‘lego world’, i.e., 
only rectangular structures are implemented. Each call of the scene generator gener- 
ated a different randomized scene according to the rules in the generation description. 
Figure 2 shows scenes generated by a number of runs of the scene generator. 
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Fig. 2. Examples scenes with randomly selected and placed objects 


4.2 Validation Parameter Space (VPS) 


Another core aspect of our validation approach is to parameterize all possible varia- 
tions of scene, sensor parameters, and states in a unified validation parameter space 
(VPS): Objects in a scene graph can be manipulated by considering their properties 
or attributes as a list of variable parameters. A qualitative overview of those param- 
eters is given in Table 2. Most attributes are of geometrical nature, but also materials 
or properties of light sources can be varied, as depicted in Fig. 3. 

In addition to static properties, a scene graph can include object properties that vary 
over time. Some of them are already included in Table 2, such as the trajectories of 
objects and sensors, indicated as 7 = (T(t)), with discrete time instants t. Computer 


Table 2 Overview of parameters to vary in a scene 


Object class Variable parameters 

Static object, e.g., buildings Limited to position, orientation, and size 

Streets, roads Geometry (e.g., position, size of lanes, etc.), friction (as 
function of weather conditions) 

Vehicles Ty = (position, orientation), trajectory 7y = (Ty(t)) 

Humans (pedestrian) T, = (position, orientation), trajectory 7p = (Tp(t)) 

Environment Light, weather conditions 

Sensors T, = (position, orientation), trajectory 7; = (T,(t)), sensor 


attributes 
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Fig. 3 Example of scene parameter variation; in this case the time of the day is varied, causing 
dramatic changes in the scene illumination according to contrast variations 


graphic systems handle these temporal variations, also known as animations, and in 
principle, any attribute can be varied over time by these systems. 

We introduce an important restriction in the current implementation of our valida- 
tion and simulation engine: Our animations are fixed, i.e., they do not change during 
the runtime of the simulation in order to allow deterministic and repeatable object 
appearance, like poses of characters. This could be different for example when a com- 
plete autonomous system is simulated, as the actions of the system might change the 
way other scene agents react. We will include these aspects in the discussion and 
outlook and will discuss how these aspects could be mitigated. 

For the use in our validation engine, as described in the next section, we augment a 
description of the scene in a scene graph (the asset) as outlined above, with an explicit 
description of those parameters which are variable in a validation run. Currently, our 
engine considers a list of numerical parameters with the following attributes: 


parameter_name, scene_graph_ref, type, minimum, maximum 


A specific example to describe variations of the position of a person in a 2-D plane 
in pseudo markup notation is 


{pl, scene.person-1.pos.x, FLOAT, 0.0, 20.0} 
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Parameters, such as p1, are unique parameter identifiers used in the validation 
engine to produce and test variations of the scene. 


4.3 Computation of Synthetic Data 


Synthetic data is generated using computer graphics methods. Specifically for color 
(RGB) images, there are many software systems available, both commercially and 
as open source. For our experiments in this chapter, we are using Blender, as this 
tool allows importing, editing, and rendering of 3D content, including scripting. 

The generation of synthetic data involves the following steps: First, a 3D scene 
model with a city model and pedestrians is generated using the probabilistic scene 
generator and is stored in one or more files. 

The scene files are loaded into one scene graph and objects have a unique identifier 
and can be addressed by the following naming convention: 


root_object.{subcomponent}.attribute 


For the example used in Sect.4.2, scene.person-1.pos.x refers to a 
path from the root object scene to the object person-1 and addresses the 
attribute pos.x in person-1. The object names are composed of 
ObjectClass-ObjectInstancelID. These conventions are used to assign a 
class or instance labels during ground truth generation. 

The labels for object classes will be mapped to a convention used in annotation 
formats (i.e., as used with in the Cityscapes dataset [COR+16]) for training and 
evaluation of the perception function. The 2D image of a scene is computed along 
with the ground truth extracted from the modeling software rendering engine. 

Using a second parameter pos . y, as included in the example in Sect. 4.2, would 
allow the positioning of the person in a plane, spanned by x- and y-axis of the 
coordinate system defined by the scene graph. 


4.4 Sensor Simulation 


We implemented a sensor model with the function blocks described in the chapter 
‘Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Sim- 
ulation’ [HG22], Sect.3.1 ‘Sensor Simulation’, and depicted in Fig.2. The module 
expects images in linear RGB space and floating point resolution as provided by the 
state-of-the-art rendering software. 

We simulate a camera error model by applying sensor noise, as additive Gaussian 
noise (with zero mean and freely selectable variance) and an automatic, histogram- 


6 www.blender.org. 


A Variational Deep Synthesis Approach for Perception Validation 369 


Fig. 4 Realistic sensor effect simulation: Standard Blender tone-mapped output (left), and the 
sensor simulation output (right) 


based exposure control (linear tone-mapping), followed by non-linear Gamma cor- 
rection. Further, we simulate the following lens artifacts chromatic aberration and 
blur. Figure 4 shows a comparison of the standard tone-mapped 8-bit RGB output of 
Blender (left) with our sensor simulation. The parameters were adapted to match 
the camera characteristic of Cityscape images. The images do not only look more 
realistic to the human eye, they also are closing the domain gap between the syn- 
thetic and real data (for details see the chapter ‘Optimized Data Synthesis for DNN 
Training and Validation by Sensor Artifact Simulation’ [HG22]). 


4.5 Computation and Evaluation of Perceptional Functions 


Perception functions consist of a multitude of different approaches considering the 
wide range of different tasks. For experiments presented in this chapter, we are 
considering the tasks of semantic segmentation and 2D bounding box detection. In 
the first task, the perception function segments an input image into different objects 
by assigning a semantic label to each of the input image pixels. One of the main 
advantages of semantic segmentation is the visual representation of the task which 
can be easily understood and analyzed for flaws by a human. 

For semantic segmentation we consider two different topologies: DeeplabV3+ 
as proposed in [CPK+18] and Detectron2 [WKM+19], both are utilizing 
ResNet101 [HZRS16] backbones. 

These algorithms are trained on three different datasets to create three different 
models for evaluation. The first dataset is the Cityscapes dataset [COR+16], a collec- 
tion of European urban street scenes during daytime with good to medium weather 
conditions. The second dataset is A2D2 [GKM+20]. Similar to the Cityscapes dataset 
it is a collection of European urban street scenes and additionally it has sequences 
from driving on a motorway. The last dataset, KI-A tranche 3, is a synthetic dataset 


370 O. Grau et al. 


provided by BIT-TS, a project partner of the KI-Absicherung project,’ consisting of 
urban street scenes inspired by the preceding two real-world datasets. All of these 
datasets are labeled on a subset of 11 classes which are alike in these datasets to pro- 
vide comparability between the results of the different trained and evaluated models. 

For the second task, the 2D-bounding box detection, we utilize the single-shot 
multibox detector (SSD) by [LAE+16], a 2D-bounding box detector trained on the 
synthetic data for pedestrian detection. This bounding box detector is applied on our 
variational data in Sect.5. 

To measure the performance of the task of semantic segmentation, the mean 
intersection-over-union (mIoU) from the COCO semantic segmentation benchmark 
task is used [LSD15]. The mlIoU is denoted as the intersections between predicted 
semantic label classes and their corresponding ground truth divided by the union 
of the same, averaged over all classes. Another performance measure utilized is the 
pixel accuracy (pAcc) which is defined as follows: 


TP+TN 
TP+FP+FN+TN’ 


pAcc = (1) 

The number of true positives (TP), true negatives (TN), false positives (FP), and 
true negatives (TN) are used to calculate pAcc, which can also be seen as a measure 
for correctly predicted pixels over all pixels considered for evaluation. 

For the 2D-bounding box detection we are interested in cases where, according to 
our definition, the performance-limiting factors are within bounds where the network 
should still be able to correctly predict a reasonable bounding box for each object 
to detect. For each synthesized and inferred image, the true positive rate (TPR) is 
calculated. The TPR is defined as the number of correctly detected objects (TP) over 
the sum of correctly detected and undetected objects (TP+FN). As we are interested 
in prediction failure cases we can then filter out all images with a true positive rate 
(TPR) of 1 and are left with images where the detection has omitted objects to detect. 


4.6 Controller 


The VALERIE controller (as depicted in Fig. 1, validation flow control) executes the 
validation run. This run can be configured in multiple ways depending on how much 
synthetic data is generated and evaluated. Two aspects have a major influence on 
this: First, the specification of parameters to be varied, and second, the used sampling 
strategy, which also depends on the validation goal. Both aspects are briefly described 
in the following. 


Specification of variable validation parameters: As outlined in Sect.4.2, the 
approach depends on the provision of a generative scene model. This consists of 
a parameterized 3D scene model and includes 3D assets in the form of static and 


7 https://www.ki-absicherung-projekt.de/. 
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dynamic objects. On top of this, we define variable parameters in this scene as an 
explicit list, as explained in Sect. 4.2. 

For the specification of a validation run, all or a subset of these parameters are 
selected and a range and sampling distribution for that specific parameter is added. 
For example, to vary the x-position of a person in the scene along a line with the 
uniform or homogeneous distribution and a step size of 1 m, we define 


{pl, UNIFORM, 1.5, 5.5, 1.0} 


The parameters refer to the following: Parameter p1 refers to parameter declara- 
tions of x position of person-1 in the example of Sect.4.2. The field UNIFORM 
refers to a uniform sampling distribution. Other modes include GAUSSIAN (Gaus- 
sian distribution). The parameters 1.5, 5.5, 1.0 refer to the parameter range [1.5...5.5] 
and the initial step size of 1m. 


Sampling of variable validation parameters: The actual expansion or sampling 
of the validation parameter space can be further configured and influenced in the 
VALERIE controller by selecting a sampler and validation strategy or goal. 

The sampler object provides an interface to the controller to the validation param- 
eter space, considering the parameter ranges and optionally the expected parameter 
distribution. We support uniform and Gaussian distributions. 

In our current implementation, the controller can be configured to either sample 
the validation parameter space by a full grid search, or by a Monte-Carlo random 
sampling. 

However, the step size can be iteratively adapted depending on the validation goal. 
One option here is to automatically refine the search for edge cases (or corner cases) 
in the parameter space: As an edge case, we consider here a parameter instance, 
where the evaluation function is changing between an ‘acceptable’ state to a ‘failed’ 
state (using a continuous performance metric). For our use case of person detection, 
that means a drop in the performance metric below a threshold. 

Other validation goals we are planning to implement could be the automated 
determination of sensitive parameters or (ultimately) more intelligent search through 
high-dimensional validation parameter spaces. 


4.7 Computational Aspects and System Scalability 


Our approach is designed for execution in data centers. The implementation of the 
components described above is modular and makes use of containerized modules 
using docker.® For the actual execution of the modules we use the Slurm” schedul- 
ing tool, which allows running our validation engine with a high number of variants 
in parallel, allowing the exploration of many states in the validation parameter space. 


8 www.docker.com. 


? https://slurm.schedmd.com. 
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The results presented here are produced on an experimental setup using six dual 
Xeon server nodes, each equipped with 380 GB RAM. The runtime of the rendering 
process as outlined above is mainly determined by the rendering and in the order 
of 10...15min per frame, using the high-quality physically based rendering (PBR) 
Cycles render engine. 


5 Evaluation Results and Discussion 


To evaluate the effectiveness of our data synthesis approach, we conducted exper- 
iments in generating scenes, variation of a few important parameters, and then we 
evaluated the perception performance including an analysis of performance-limiting 
factors, such as occlusions and distance to objects. 

We used our scene generator to generate variations of street crossings, as depicted 
in Fig. 2. For these examples a base ground is generated first, with flexible topology 
(crossings, t-junction) and dimensions of streets, sidewalks, etc. In the next step, 
buildings, persons, and objects, including cars, traffic signs, etc. , are selected from 
a database and randomly placed by the scene generator, taking into account the 
probabilistic description and rules. The approach can handle any number of object 
assets. The current experimental setup includes a total of about 500 assets, with about 
60 different buildings, 180 different person models, and other objects, including 
vegetation, vehicles, and so on. 


Scene parameter variation: Within the generated scenes, we vary the position and 
orientations of persons and some occluding objects. 

Further, we change the illumination by changing the time of the day. This has two 
main effects: First, it is changing the illumination intensity and color (dominant at 
sunset and sunrise), and second, it is generating a variation of shadows casted into 
the scene. In particular, from our experience, the latter creates challenging situations 
for the perception. 


Comparison of object distribution: Fig. 5 shows the spatial distribution of persons 
in a) the Cityscapes dataset, b) KI-A tranche 3 dataset, and c) a dataset using our 
generative scene generator, as depicted in Figs. 2 and 3. The diagrams present a top- 
view of the respective sensor (viewing cone) and the color encodes the frequency of 
persons within the sensor viewing cone, i.e., they give a representation of distance and 
direction of persons in all considered frames of the dataset. The real-world Cityscapes 
dataset has a distribution that corresponds with most persons located left and right of 
the center, i.e., the street. There are slightly more persons on the right side, which can 
be explained by the fact that often sidewalks on the left hand are occluded by vehicles 
from the other road side. The distribution of our dataset resembles as expected this 
distribution, with slightly less occupation in the distance. In contrast, the distribution 
of the KI-A tranche 3 dataset shows a very sharp cumulation of the distribution on 
what corresponds to a narrow band on the sidewalks of their 3D simulation. 
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Fig. 5 Pedestrian distribution over horizontal angle and distance. a: Cityscapes. b: KI-A tranche 
3. c: Our synthetic data 


Influence of different occluding objects on detection performance: A number of 
object and attribute variations are depicted in Fig. 6. On the left side, the SSD bound- 
ing box detector [LAE+16] is applied to the three images with different occluding 
objects in front of a pedestrian. In all three images, two bounding boxes are pre- 
dicted for the same pedestrian. While one bounding box includes the whole body, 
the second bounding box only covers the non-occluded upper part of the pedestrian. 
On the right side, the Deep1labV3+ model trained on the KI-A tranche 3 is used to 
create a semantic map of the same three images. Besides the arms, the pedestrian is 
detected, even partially through the occluding fence. However, another interesting 
observation can be made: The ground the pedestrian stands on is always labeled 
as sidewalk. We interpret this as an indication to a bias in the training data, as the 
training data does not include enough images of pedestrians on the road, just on the 
sidewalk. This hypothesis can be further strengthened when we inspect the pedes- 
trian distributions in Fig. 5b, where the pedestrians are distributed narrowly left and 
right off the street in the middle. Additionally, both bounding box prediction and the 
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Fig. 6 Scene with variation of occluding objects. Left: 2D bounding box detection. Right: semantic 
segmentation 


semantic segmentation do not include the pedestrian’s arms in their predictions. This 
can also be attributed to a bias in the training data. 


Influence of noise on detection performance: An experiment demonstrating our 
sensor simulation determines the influence of sensor noise on the predictive perfor- 
mance. In Fig. 7, Gaussian noise with increasing variance is applied to an image, and 
three DeeplabV3+ models trained on A2D2, Cityscapes, and a synthetic dataset, 
respectively, are used to predict on the data. While image color pixels are repre- 
sented in the range x; € [0, 255], the noise variance is in the range of o? € [0, 20] 
with a step size of 1. For each noise variance step, the mloU performance metric on 
the image prediction per model is calculated. While initially the models trained on 
Cityscapes and the synthetic dataset increase in performance, all models’ predictive 
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Fig. 7 Top: mloU performance decreases with increasing noise variance. Bottom (left to right): 
segmentation maps with increasing noise variance o? € {0, 10, 20}, image pixels x; € [0, 255] 


performance ultimately decreases with an increasing level of sensor noise. The initial 
increase can be explained to stem from the domain shift of training to validation data, 
where in the training data a small noise variance can be observed. 


Analysis of performance-limiting factors: Some scene parameters have a major 
influence on the perception performance. This includes the occlusion rate of objects, 
with totally occluded objects that are obviously not detectable or the object size (in 
pixels) in the images, also with a natural boundary where detection breaks up if 
the object size is too small. Other performance-limiting factors include contrast and 
other physically observable parameters. 

To measure the influence or sensitivity of perception functions against 
performance-limiting factors we designed an experiment using about 30,000 frames 
containing one person each. The person is moved and rotated on the sidewalk and 
on the street. The occlusions are determined by rendering a mask of the person and 
comparison with the actual instance mask considering occluding objects. A degree of 
100% represents a fully occluded object. Figure 8 shows results of this experiment, 
each gray dot representing one frame and the colored curves showing regression 
plots with differently clothed persons. 

The figure shows a pAcc downwards trend with increasing occlusion rates. The 
Detectron2 model (trained on Cityscapes) is comparatively more robust than 


376 O. Grau et al. 


Detectron2 


100 


80 


60 


pACC 


40 


20 oe i = Dark and bright clothes 
f į =— Dark clothes 
b 


| = Bright clothes 


Aara 


0 10 20 30 40 50 60 70 O 10 20 30 40 50 60 70 
occlusion rate occlusion rate 


Fig. 8 Polynomial regression curves (of order 3) on pedestrian detection rate pAcc of 
DeeplabV3+ (left) and Detectron2 (right) for various occlusion rates of a pedestrian wearing 
dark or bright clothes 


DeeplabvV3. The plot shows that Detect ron2 offers stable detection with occlu- 
sion rates <35% and then the performance drops. DeeplabV3’ s (also trained on 
Cityscapes) performance drops after 15% occlusion rate. The curves are not linearly 
following a trend due to the fact that there are other scene parameters (sunlight, 
shadow, direction of pedestrian) which are not constant across the rendered images. 

What can also be seen in the figures is that, despite the trend of the regression 
curves, there is a great variation in the data—visible by the widely scattered grey 
points. That means that the performance depends also on other factors besides the 
occlusion rate. Figure 8 is showing one example of analysis possible with the meta- 
data provided by our framework. More parameters are considered in our previous 
work [SGH20]. 


Data bias analysis: Another experiment we conducted considers failure cases, i.e., 
false negatives (FN) of the SSD 2D-bounding box detector regarding pedestrian 
detection. To accomplish this, we rendered 2640 images with our variational data 
synthesis engine. These images are then inferred by the SSD model and evaluated. 
Only pedestrians with a bounding box width greater than 0.1 x image width and a 
height of 0.1 x image height are considered valid for evaluation. Additionally, only 
objects with an occlusion rate below 25% are considered valid. These restrictions 
guarantee that pedestrians in the validation are of sufficient size, i.e., close to the 
camera, and clearly visible due to little occlusion and would therefore be easy to 
detect. 
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Fig. 9 Count of detected and non-detected pedestrians for different pedestrian assets, i.e., different 
clothing, ethnicity, and gender 


With these restrictions in place we found that from all the pedestrian assets the 
synthesis engine placed in the scene, there were six assets that were omitted by the 
SSD model as can be seen in Fig. 9. 

The asset ID 1 is an Arabian woman wearing traditional clothes effectively veiling 
the person. Asset ID 2 is a Caucasian woman clothed in summer casual, i.e., short 
pants and short sleeves, revealing parts of her skin. The second Arabian ethnicity 
asset with the ID 3 is similar to asset 1 clothed in traditional veiling clothes but of 
male gender. The remaining assets 4, 5, and 6 are of male gender and Caucasian 
ethnicity wearing different work clothes, i.e., a blue paramedical outfit for ID 4, 
business casual jeans and jacket for ID 5, and white physician clothes for ID 6. 

The asset ID 2 with the summer casual clothed woman is only miss-detected a few 
times, in most cases the detection worked well, indicating no data bias for this asset. 
In contrast, the pedestrian asset ID 6 of a physician dressed in white hospital clothing 
has not been detected at all. Additionally, two of the assets that were relatively most 
often overlooked by the network are the Arabian clothed woman with asset ID 1, as 
well as an Arabian clothed man with the ID 3. This result would suggest that these 
kind of pedestrian assets, i.e., IDs 1, 3, and 6, were not present in the data for training 
the model and adding them to it will lead to a mitigation of this exact failure case. 
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6 Outlook and Conclusions 


This chapter has introduced a new generative data synthesis framework for the vali- 
dation of machine learning-based perception functions. The approach allows a very 
flexible description of scenes and parameters to be varied and systematical tests of 
parameter variations in our unified validation parameter space. 

The conducted experiments demonstrate the benefits of splitting the validation 
process into scene variation that looks into randomized placement of objects and a 
variation of scene parameters and sensor simulation. Our simple probabilistic scene 
generator is scalable and able to produce scenes with a high number of different 
objects—as provided by an asset database. The spatial distribution of the positioned 
objects, as demonstrated for persons in Fig. 5, is more realistic compared to manually 
crafted 3D scenes. Along with our sensor simulation (results discussed in the chapter 
‘Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact 
Simulation’ [HG22]), we present a step to close the domain-gap between synthetic 
and real data. Future work will continue to analyze the influence of other factors, 
such as rendering fidelity, scene complexity, and composition, to further improve the 
capabilities of the framework and make it even more applicable for the validation of 
real-world AI functions. 

Our experiments with performance-limiting factors, as shown for occlusion rates 
and object size (as a function of distance to the camera) in the previous section gives 
clear evidence that the performance of perception functions cannot be characterized 
by only a few factors. It is, however, a complex function of many parameters and 
aspects, including scene complexity, scene lighting and weather conditions, and the 
sensor characteristics. The deep validation approach described in this chapter is 
addressing this multi-dimensional complexity problem and we designed a system 
and methodology for flexible validation strategies to span all these parameters at 
once. 

Our validation parameterization, as demonstrated in the results section, is an 
effective way to detect performance problems in perception functions. Moreover, it 
allows in its flexible design the sampling and a practical computation at scale allowing 
for deep exploration of the multi-variate validation parameter space. Therefore, we 
see our system as a valuable tool for the validation of perception functions. 

Moving forward we are looking into using the deep synthesis approach to imple- 
ment sophisticated algorithms to support more complex validation strategies. As 
another key direction we target improvements in the computational efficiency of our 
validation approach, allowing coverage of more complexity and parameter dimen- 
sions. 
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The Good and the Bad: Using Neuron R) 
Coverage as a DNN Validation Technique  ¢ss 


Sujan Sai Gannamaneni, Maram Akila, Christian Heinzemann, 
and Matthias Woehrle 


Abstract Verification and validation (V&V) is a crucial step for the certification 
and deployment of deep neural networks (DNNs). Neuron coverage, inspired by 
code coverage in software testing, has been proposed as one such V&V method. 
We provide a summary of different neuron coverage variants and their inspiration 
from traditional software engineering V&V methods. Our first experiment shows 
that novelty and granularity are important considerations when assessing a coverage 
metric. Building on these observations, we provide an illustrative example for study- 
ing the advantages of pairwise coverage over simple neuron coverage. Finally, we 
show that there is an upper bound of realizable neuron coverage when test data are 
sampled from inside the operational design domain (in-ODD) instead of the entire 
input space. 


1 Introduction 


In the past few years, there has been rapid growth in the usage of deep neural net- 
works (DNNs) for safety-critical applications. Computer vision models, in particular, 
are being used in, e.g., autonomous driving or medical diagnostics. This shift in use 
cases is met by increasing demand for verification and validation (V&V) methods 
for DNNs [BEW+18, ZHML20, RJS+20]. However, unlike in traditional software 


S. S. Gannamaneni (BX) - M. Akila 

Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, 
Schloss Birlinghoven 1, 53757 Sankt Augustin, Germany 

e-mail: sujan.sai.gannamaneni @iais.fraunhofer.de 


M. Akila 
e-mail: maram.akila@iais.fraunhofer.de 


C. Heinzemann - M. Woehrle 

Robert Bosch GmbH, Corporate Research, Robert-Bosch-Campus 1, 
71272 Renningen, Germany 

e-mail: christian.heinzemann @de.bosch.com 


M. Woehrle 
e-mail: matthias.woehrle @de.bosch.com 


© The Author(s) 2022 383 
T. Fingscheidt et al. (eds.), Deep Neural Networks and Data for Automated Driving, 
https://doi.org/10.1007/978-3-031-01233-4_14 


384 S. S. Gannamaneni et al. 


engineering, developing rigorous techniques for V&V of DNNs is challenging as 
their intrinsic working is not directly of human design. Instead, the relevant param- 
eters are learned, and their meaning often eludes the human engineer. 

With the strengthened focus on V&V for DNNs, there was a gradual shift away 
from mere performance metrics to developing techniques that measure and inves- 
tigate, e.g., (local) interpretability or robustness. While insightful, such approaches 
focus on specific instances of a given (test) dataset. However, they typically do not 
detail the required number of elements in those datasets in the sense of sufficient 
testing. 

Inspired by testing in software engineering, Pei et al. proposed a variant of code 
coverage called neuron coverage (NC) [PCYJ19]. In analogy to checking if a line of 
code is executed for any given test sample, NC provides us with the percentage of 
activated neurons for a given test dataset. 

For a test (or in this case: coverage) metric, the granularity or “level of difficulty” is 
crucial; if a test is too easy to fulfill, the chances are low that it will uncover errors. If, 
on the other hand, it is barely possible (or even impossible) to fully perform the test, 
it becomes ill-defined as no stopping criterion exists or can be found (cf. [SHK+19]). 
This makes the level of granularity of a test a decisive criterion. We can illustrate this 
on the example of classical code coverage for the pseudo-code shown in Algorithm 1. 


Algorithm 1 Code coverage example 
Seta 
:c 412 
: if a > 2 then 
c< —4 


1: 
2: 
3 
4: 
5: end if 
6 
T 
8 
9 


: if a < 4 then 
: a<c-a 
: end if 

: returna +c 


It takes a (real) input variable a and maps it onto a (real) output. Following code 
coverage requirements, we would need to find inputs a; such that in total, each 
line of code is executed at least once. The two values {a; = 1, ag = 5} would lead 
the algorithm to either pass through the first or the second if-clause, respectively, 
and would thereby together fulfill this requirement. The advantage of this test is 
that, besides being simple, it has a well-defined end when the full code is covered. 
Conversely, should not all of the code be coverable, it directly revealed unreachable 
statements that can be removed from the code without changing its functionality. 
The latter would be seen as a step towards improving the code’s “readability” and as 
avoiding potentially unintended behavior, should those parts of code be reachable by 
some unforeseen form of a statement. However, this form of coverage can provide 
only a rudimentary test of the algorithm. For instance, the output in the example 
differs significantly if the algorithm runs through both clauses simultaneously, which 
would be the case, e.g., fora; = 3. We cannot (or at least not with a reasonable amount 
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of effort) test a given code for all potential eventualities. However, we can evaluate 
with more complex test criteria, pair-wise tests in the place of the single-line tests. 
A reasonable approach could be to test if both if-clauses “A” and “B” were used in 
the same run or whether only one of them was passed. While this approach would 
surely have uncovered the potentially erroneous behavior from the third input a3, 
it also scales quadratically with the number of clauses, while the previous criterion 
scaled linearly. Additionally, it is a priori unclear to which extent an algorithm can 
be covered in principle. For instance, in the example, no value for a can be found 
such that both if-clauses would be skipped simultaneously. 

The above example on algorithmic code coverage can be transferred to the cov- 
erage of neurons within a DNN. Assuming, as we will also do in the following, an 
architecture based on ReLU non-linearities, where ReLU (y) = max (y, 0), we may 
count a unit (neuron) as covered, if, for a given input, its output after application of 
the ReLU is non-zero.! Full coverage, for this measure, would thus be achieved if 
a given test set X can be found such that for any unit of the DNN, at least one data 
sample x; € ¥ exists where that unit is covered. As with the code coverage before, 
the inability to construct or find such a sample x; could indicate that the unit in 
question is superfluous and might be removed (pruned) without changing the func- 
tionality of the DNN. The generalization to pair-wise metrics is straightforward but 
owing to the often sequential structure of feed-forward DNNs, a further restriction to 
adjacent layers is considered, i.e., pairs of only adjacent layers are considered when 
evaluating with this metric. There are, however, two important distinctions to code 
coverage. At first, interaction among layers, although reminiscent of the single con- 
trol flow in the above code example, is significantly more complicated. Examining 
Algorithm | more closely, the interaction was mediated by a latent variable c. To 
some extent, each unit within a layer can be seen as such a variable allowing for 
various behaviors in the subsequent layer. Moreover, not activating a unit is not an 
omission but a significant statement in its own right, compare for instance [EP20]. 
As a second point, input parameters and variations in classical software are often 
reasonably understood. For DNNs, on the contrary, often their use is motivated by 
the inability to specify such variations. Therefore, a secondary use of coverage met- 
rics to determine the novelty of a data point x; w.r.t. a test set X’ (x; ¢ X’) becomes 
apparent by measuring how strongly the coverage values between V’ and X’ U {x;} 
differ. Such a concept is based on the assumption that coverage, to some degree, 
represents the internal state (or “control flow” if seen from a software point of view) 
of the DNN and could thus be used to determine the completeness of the test set in 
the sense that no further novel states of the DNN could be reached. 

In this chapter, we address the following aspects: On the level of single unit cover- 
age, we investigate the effect of granularity, i.e., to which degree layer-wise resolved 
metrics allow for a more detailed test coverage. In this context, we also investi- 
gate whether the coverage metric is informative regarding the novelty of samples by 
comparing trained and untrained DNNs on CIFAR-10. For both single and pairwise 


! While generalizations to non-zero thresholds exist, see for instance [MJXZ+18], we restrict our- 
selves to this case as it is a natural choice for ReLU non-linearities. 
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coverage, we address under which circumstances (full) coverage can be reached. Ina 
(synthetic) toy example, we also differentiate how strongly these statements depend 
on the restrictions posed on the test set data when staying within a DNN’s intended 
input domain (i.e., the operational design domain, ODD). 


2 Related Works 


Pei et al. [PCYJ19] introduced neuron coverage, i.e., looking at the percentage of 
neurons activated in a network, as a testing method. Using a dual optimization goal, 
they generate realistic-looking test data that increases the percentage of activated 
neurons. The method proposed in [MJXZ+18], a variant of [PCYJ19], uses the same 
definition of NC and additionally introduces k-multisection coverage and neuron 
boundary coverage. Another variant, [TPJR18], uses image transformations on syn- 
thetic images to maximize neuron coverage. 

Sun et al. [SHK+19], inspired from modified condition/decision coverage 
(MCDC) in the software field, proposed four condition decision-based coverage 
methods. Furthermore, they showed that the proposed methods subsumed earlier 
works, i.e., satisfying their proposed methods also satisfies the weaker earlier cov- 
erage methods. 

However, there are also several works expressing criticism of NC as a validation 
technique [DZW+19, HCWG+20, AAG+20]. While in [DZW+19], it is shown that 
there is limited to no correlation between robustness and coverage, Harel-Canada 
et al. [HCWG+20] show that test sets generated by increasing NC, as defined in 
[PCYJ19, TPJR18], fail in specific criteria such as defect detection, naturalness, and 
output impartiality. Abrecht et al. [AAG+20] showed that there was a discrepancy 
in the way how neurons in different layers were evaluated in earlier works due to 
differing definitions of what constitutes a neuron in a DNN. Furthermore, they show 
that evaluating NC at a more granular level provides more insight than just looking 
at coverage over the entire network. However, they also show that even this granular 
coverage is still easy to achieve with simple augmentation techniques performing 
better than test data generation methods like DeepXplore [PCYJ19]. 

Abrecht et al. [AGG+21] presented an overview of testing methods of computer 
vision DNNs in the context of automated driving. This includes a discussion of 
different types of adequacy criteria for testing and concretely of structural coverage 
for neural networks as one form of an adequacy criterion that supports testing. 

In this chapter, we follow the definitions of layer-wise coverage (LWC) [AAG+20] 
to determine NC and adjust it for pairwise coverage (PWC). While the fundamental 
definition, i.e., a neuron or a unit is considered as covered as long as there is at least 
one sample x; in the test set X such that the unit produces a non-zero output, remains 
unchanged across definitions, the precise meaning of what constitutes a unit differs. 
Compared to other metrics, e.g., DeepXplore [PCYJ19], LWC provides a more fine 
granular resolution of coverage, especially concerning convolutional layers. While 
one could treat each channel of the convolutional layer as a single unit with respect 
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to coverage, LWC demands that each instance of the use of the filter is counted as an 
individual unit to be covered. In terms of an example, if a two-channel convolutional 
layer results in an output of 3 x 3 x 2 real numbers, where the first two numbers 
are due to the size of the input image, LWC would require to cover each of the 18 
resulting “neurons” while DeepXplore would consider only two “neurons”, one for 
each channel. 


3 Granularity and Novelty 


Several works [SHK+19, AAG+20] have shown that achieving high NC is trivial 
with only a few randomly selected in-domain non-adversarial images. Abrecht et 
al. [AAG+20] have further shown that trained and untrained networks show similar 
NC behavior for a given test dataset. Intuitively, one would expect that, first, a metric 
used to investigate the safety of a DNN in terms of finding new input samples should 
not be easily satisfiable and, second, that the metric should depend on the “quality” 
of the input. Typically, a DNN is trained to solve a specific problem that only covers a 
very small subset of all possible inputs, e.g., classifying an image as either showing a 
cat or dog could be contrasted by completely unfitting inputs such as the figures in this 
chapter. Following the analogy of code coverage, this would be the difference between 
expected and fully random (and mostly unfitting) inputs, where for the latter, one 
would expect nonsensical outputs or error messages. Especially for neural networks, 
this second point is of relevance as they are typically used in cases where a distinction 
between reasonable and unreasonable input can be made only by human experts or 
DNNs but not via “conventional” algorithms. This makes an extension of the input 
coverage to novel examples challenging, as “errors” discovered on erroneous inputs 
might lead to questionable insights into the DNNs trustworthiness or reliability. 

Within this section, we elaborate on the two raised concerns by investigating an 
example DNN on the level of (single-point) NC. For this, we will look not only 
at the reached coverage but take a closer look at its saturation behavior. Further, 
we compare both trained and untrained (e.g., randomly initialized) versions of the 
DNN using either actual test data (from the dataset) or randomly generated Gaussian 
input. To be more precise, we use VGG16 [SZ15] networks with ReLU activations 
and train them on CIFAR-10 [Kri09], a dataset containing RGB images of (pixel-, 
channel-) size 32 x 32 x 3 with 10 classes (containing objects, such as automobiles, 
but also creatures, such as frogs or cats) from scratch. After training,” our network 
reaches an average accuracy of 82% on the test set. As stated above, we employ 
the layerwise coverage (LWC) notion defined by Abrecht et al. [AAG+20] to mea- 
sure NC as it subsumes less granular metrics, such as the often-used definitions of 
DeepXplore [PCYJ19]. Subsumption implies that if DeepXplore does not reach full 
coverage, neither will LWC but not necessarily the other way around. 


2 We employ an SGD optimizer with a learning rate of 107° for 20 epochs and a learning rate decay 
of 0.1 at epochs 10 and 15. 
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Fig. 1 Here, we show the neuron coverage in a VGG16 network. The blue solid and dashed lines 
show mean coverage for trained and untrained models on CIFAR-10 test images, respectively. The 
orange solid and dashed lines show mean coverage for trained and untrained models on randomly 
generated data samples, respectively. The standard deviation is shown for each experiment as the 
shaded area 


In Fig. 1, we plot the NC of the used VGG16 DNNs with respect to 1000 test 
samples for all four test cases. For each case, we employed five different DNNs, 
independently initialized and/or trained, and averaged the results for further stability. 
For evaluation, we use both the real CIFAR-10 test set data and randomly generated 
data samples from a Gaussian distribution loosely fitting the overall data range. 
While the coverage behaves qualitatively similar in all cases, a clear separation 
between (assumed) limiting values and saturation speeds can be seen. This difference 
is most pronounced between the trained DNN that is tested with CIFAR-10 test data, 
representing the most structured case as both the data and the weights carry (semantic) 
meaning, and all other cases, which contain at least one random component. 

A rough understanding of the behavior of NC can be acquired when considering 
the following highly simplified thought experiment. Assuming that the activation of a 
neuron was an entirely probabilistic event in which a given input sample activates the 
neuron with probability 0 < p < 1, then the chance of said neuron to remain inactive 
for t samples would be q' with q := 1 — p. While this suggests an exponential 
saturation rate for the coverage of a single neuron, the idea only carries over to the full 
set of all considered neurons if one assumes further that their activation probabilities 
p’ were identical and their activations independent. Even omitting only the first 
condition (identicality) can lead to qualitatively different behavior, in which specific 
properties would depend on the distribution of p’s among the neurons. A typical 
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Fig. 2 Log plot of mean coverage in first (left) and penultimate (right) layer for trained and untrained 
networks. In the first layer, neuron coverage shows similar behavior for trained and untrained 
networks. However, in the penultimate layer, there is a significant difference in behavior between 
trained and untrained networks. The standard deviation is shown as shaded region. As we consider 
log(1 — C(t)/N), with the number of covered neurons C(t), total neurons N and test samples t, 
the value approaches “negative infinity” when full coverage is achieved. This behavior can be seen 
in the figure on the left. The x-axis on the right is on log scale 


expectation for a broad distribution would be an approximately algebraic saturation. 
Unsurprisingly, we can recover both types of behavior within the experiments. 

A connection to the above thought experiment is obtained when investigating NC 
layer-wise. While from the perspective of testing, one might argue that the coverage 
of any given layer is subsumed in the coverage of the full model, the differing behavior 
of the layers warrants investigating NC on this level of granularity. For the first and 
the penultimate layer of VGG16, the results are depicted in Fig. 2. In the first layer, 
one would expect that, at least for random input, each neuron is equally well suited 
to be activated, and thus saturation would be reached exponentially. As can be seen, 
this is almost ideally given for the untrained DNN if tested on random input, and in 
good approximation still when using the structured CIFAR-10 input or when testing 
the trained DNN with the random samples. In the latter case, the decision boundaries 
of each trained neuron are “random” with respect to the incoming data, which does 
not represent any meaningful semantic concepts or correlations. Note that we do not 
show the coverage in the figure but rather the number of not-yet-activated neurons in 
a logarithmic fashion; straight lines represent exponential saturation. Regarding the 
speed of saturation, i.e., the exponent or probability p, the naive assumption is that a 
neuron separates the input via a half-plane and each case, activation or inactivation, 
is equally likely. This corresponds to a coin-flip with probability p = q = 1/2 and 
is depicted in the figure for comparison. Only the trained DNN, when tested on 
actual test data, deviates from this behavior and saturates more slowly. This might be 


3 Consider, for comparison, how a superposition or mixture of Gaussians turns into a heavy-tailed 


distribution, e.g., Student’s ¢, if the standard deviations of each Gaussian differ strongly enough, 
cf. [Bar12]. 
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explained by the fact that the network can use feature-based correlation in the data, 
and a new test sample is therefore of a more limited novelty than a fully random, and 
thus unexpected one. 

Turning to the penultimate layer, shown on the right-hand side in Fig. 2, we choose 
a double-logarithmic visualization such that algebraic saturation is now given by 
straight lines. To some extent, all four test cases seem to satisfy this type of satu- 
ration. However, the exponent, i.e., the speed of saturation, differs significantly and 
is slowest for both cases of the untrained DNN. A possible explanation might sim- 
ply be the propagation of information that is not optimized in a random DNN. By 
design, a (classification) DNN maps a large input (here 3072-dimensional) onto a 
low-dimensional output (10 classes) and, thus, has to discard large amounts of infor- 
mation along the layers. This heuristic assumption is supported by the behavior of 
the trained DNN, which achieves high coverage significantly faster, even more so if 
the input contains features it was optimized to recognize (i.e., actual test data). 

As shown, NC does depend on the training and the use of an appropriate test 
set. Even then, the final coverage might be reached only slowly. However, returning 
to the original Fig. 1, relatively high coverage is reached early on with only a few 
samples. It is, therefore, more a question of resolving a “hand-full” of not-covered 
neurons that might not be receptive to the data used. In the next section, we will look 
more closely into whether or not all neurons can even be covered if the used data 
are restricted to the operational domain of the network. This property, catching large 
amounts of the coverage with only limited effort, might not be desirable from the 
perspective of testing, especially if the novelty of samples is supposed to be judged. 
For such use cases, more granular tests might be more appropriate. A straightforward 
extension might be to include not only “positive” neuron coverage (NC+) as done 
before but also “negative” coverage (NC-) where a neuron was not activated at least 
once. For a neuron, it might be the case that it is almost always active, and therefore 
the rare cases where it does not fire might be of interest. For this reason, both types 
of coverage are to some degree independent, i.e., while it can be possible for a DNN 
to reach full NC+ coverage, this does not necessarily ensure full (NC-) coverage, 
and vice versa. Especially for networks with ReLU activations, (NC-) coverage is 
of interest as the non-linearity only takes effect when a neuron is not active. For a 
discussion, see, for instance, [EP20]. 

As discussed in the introduction, a further logical extension could be a switch from 
measuring single neuron activations or inactivations to a notion of pairwise coverage 
(PWC). Here, we follow a simplified version of the sign-sign coverage introduced 
in [SHK+19]. Due to the typically feed-forward type of network structure, we limit 
ourselves to adjacent layers with a pair of neurons as the unit of interest. A pair 
of neurons, nge i, Nne+1,j from layers £ and £ + 1, counts as covered with respect to 
PWC++4, if for a given input both neurons are active, i.e., assuming values greater 
than 0. Similarly, we define PWC-—- for cases where the pre-activations are both non- 
positive, and the neuron outputs thus zero. Likewise, the two alternating activation 
patterns, PWC+— and PWC-"-+, can be defined. It is easy to see that if a DNN is fully 
PWC++ (or PWC—-) covered, it is also fully NC+ (NC-, respectively) covered. 
Therefore, the PWC metrics subsume the single neuron NC metrics. However, this 
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Table 1 NC and PWC: This table shows the big difference between the number of countable units 
of neurons in NC and pairwise evaluation in PWC. Neurons are defined as per LWC [AAG+20]. 
For simplicity, we do not consider the effect of skip connections 


Network NC [10°] PWC [10°] 
LeNet-5 [LBBH98] 0.081 81 
VGG-16 [SZ15] 3.159 82,030 
VGG-19 [SZ15] 3.425 85,428 
ResNet -18 [HZRS16] 0.548 1,673 
ResNet -34 [HZRS16] 0.804 2,265 
ResNet-50 [HZRS16] 2.309 12,117 
ResNet -101 [HZRS16] 3.354 13,721 


increased granularity comes at a price. While the number of tests or conditions 
required to be fulfilled to reach full NC coverage scales linearly with the number of 
neurons or units, pairwise metrics follow a quadratic growth. Due to the restriction 
to neighboring layers, the growth is effectively reduced by the number of layers, 
and the number of conditions can be approximated by (#neurons)? /#layers. Exact 
numbers for a selected few architectures are reported in Table | including the VGG16 
example used so far. Note that for simplicity, skip connection combinations, although 
forming direct links between remote layers, are not considered. Including them into 
the pairwise metric would lead to larger numbers. Even without, the sheer magnitude 
shows that it might be unlikely to test for full coverage. Additionally, it is unclear 
whether full coverage can be reached in principle if the neuron output is correlated 
across layers. As this might pose a problem for tests that require an end criterion 
signifying when the test is finished, we include those more elaborate metrics in the 
next section where we investigate whether full coverage can be reached. 


4 Formal Analysis of Achievable Coverage 


In this section, we analyze coverage metrics based on a very simple task and a sim- 
ple domain as described in Sect.4.1. The key design objectives for the creation of 
the task are that (i) arbitrary many data samples can be generated, (ii) a ground 
truth oracle exists in closed form, and (iii) the trained network and the input domain 
are sufficiently small such that specific activation patterns can be computed analyti- 
cally. We describe the experimental setup in detail in Sect. 4.2, before providing the 
experimental results in Sect. 4.3, followed by a discussion in Sect. 4.4. 
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4.1 Description of the Task 


Our simple task is counting the number of input “pixels” with positive values. We 
restrict ourselves to a system with four input pixels, whose values are constrained to 
the interval Z = [—1, 1]. We call this the operational design domain (ODD) of our 
network, following [KF19]. This is also our training distribution. We can obviously 
also perform inference on samples outside this ODD, but these will be outside the 
training distribution since we only train on ODD samples. We frame this task as a 
classification problem with five output classes: the integers from 0 to 4. Thus, our 
input domain is T*, our output domain is O = {0, 1, 2,3, 4}, and we can define a 
ground truth oracle as cardinality |{i|i € Z, i > O}|, i.e., we count the pixels with 
a value above 0. In addition, we explicitly define all inputs in R4 \ Z4 as out-of- 
distribution samples, with 7 4 c Ri. 

The rationale of our approach is as follows: What can be achieved with perfect 
knowledge of the system in the sense of (i) symbolically computing inputs to achieve 
activation patterns, (ii) having a clear definition of in-ODD and out-of-ODD, and 
(iii) having a perfect labeling function for novel examples? 

For this task, we use a multilayer perceptron (MLP) with three hidden fully con- 
nected layers FC(n) with n output nodes. The architecture of the MLP is shown 
in Fig.3. We train our network using PyTorch. We generate one million inputs 
uniformly sampled from Z*. We split these samples into training and test data sets 


xeT* 


FC(12) + ReLU 


FC(10) + ReLU 
FC(6) + ReLU 


Softmax 


Argmax 
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Fig. 3 Architecture of the MLP 
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Fig. 4 Distribution of labels in training and test set, relative distribution is identical for both 


Table2 Accuracy [%] overall and per class for MLPady and MLPpjain. Best numbers are in boldface. 
Performance of MLPpjain is slightly better, but differences are insignificant 


Network Overall 0 1 2 3 4 


using an 80/20 split, i.e., we end up with 800,000 training samples and 200,000 test 
samples. The distribution of the respective ground truth labels is shown in Fig. 4. 
The distribution of the samples is as expected as there are 16 unique combinations of 
positive and negative pixels in Z of which 6 yield two positive pixels, 4 yield one or 
three positive pixel(s), respectively, and 1 combination yields 0 or 4 positive pixels, 
respectively. 

In the following, we study the impact of coverage models and the ODD on specific 
networks to show specific effects rather than a statistical evaluation. Nevertheless, 
we show that this is not a singular effect by training the MLP, whose architecture is 
depicted in Fig. 3, with two training setups, leading to two different MLPs, MLPaay 
and MLPpiain. We train both MLPs for 100 epochs using ADAM as an optimizer 
with default PyTorch settings, a batch size of 100, standard cross-entropy loss, 
and a single initialization. For increasing robustness, MLP,ay leverages adversarial 
training, here with the fast gradient sign method (FGSM) [GSS 15], using € = 1074, 
which corresponds to the attack strength. For MLPpjain, we use the same training setup 
but do not apply adversarial training. Table 2 summarizes the final performances on 
the test set for the overall accuracy and the per-class accuracy for both networks. 
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We can see that MLPpjain has a slightly higher accuracy overall, as well as for four 
out of the five output classes, than MLPaay. Note that the lower accuracy of MLPaay 
is in line with standard robust training and shows that when trying to train a robust 
network, more capacity may be necessary [GSS15]. We focus in the following on 
the adversarially trained model for the sake of brevity since both setups resulted in 
similar results. 


4.2 Experimental Setup 


In our experiments, we want to evaluate whether full coverage according to neuron 
coverage (NC) and pairwise coverage (PWC) can be achieved. The networks for the 
experiments are the trained MLPs described in the previous section. The goal of the 
experiment is to calculate a minimal number of inputs that maximizes coverage on 
the MLP and, if required, a number of non-coverable items. 

Since the MLP is simple enough, we can solve the coverage generation sym- 
bolically. For this, we use a satisfiability modulo theory (SMT) solver, namely, 
Z3 [dMBO08], and encode the problem of finding inputs maximizing coverage as 
SMT queries based on the theory of unquantified linear real arithmetic, i.e., Boolean 
combinations of inequations between linear polynomials over real variables. Note 
that there are also further approaches based on SMT targeting larger neural networks 
like Reluplex [KBD+17], which are, however, not needed for our example. We 
iteratively compute solutions for individual neurons (or combinations of neurons) as 
shown in Algorithm 2: We determine all coverage items for a neural network as a set 
B. While there are still coverage items, we select one of these and try to cover it as 
described in the following. We formulate the coverage of itemi € Bas an SMT query 
and call the SMT solver on this query using runSMT () . If it succeeds as indicated 
by the return value success being true, we run the network NN on the resulting input 
x by executing cal 1MLP(). This returns all resulting coverage items J, which are 
added to the set of already covered coverage items C. Minimally, we find the coverage 
item i from the SMT query, but there may be more. If the SMT solver cannot cover 
item i, we add it to the set of uncoverable items U. Note that this implicitly also 
depends on a specific coverage metric, which we omit in the algorithm for simplicity 
of the presentation. 

The function runSMT() generates individual samples based on a program for 
Z3 [dMBO8], or Z3 Py’, respectively, as a list of constraints. These constraints encode 
the network, the input domain, and the coverage item. When the SMT determines the 
constraint system as satisfiable, it returns a model. From this model, we can extract 
the corresponding inputs to the network that lead to a satisfiable result. 

For each coverage item, we run two experiments with and without a constrained 
input domain. First, we use constraints as shown above that allow the Z3 solver 
to only consider in-ODD samples from Z4. Second, we also allow Z3 to generate 


4 https://pypi.org/project/z3-solver. 
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Algorithm 2 High-Level Algorithm 


1: = procedure Comput eSMTCoverage(NN) > NN is the network to be covered 
B < getCoveragelItems(NN) 

3 C< ø > items already covered 
4 UuU<G > uncoverable items 
5: while B \ {C UU} Æ Ø do 
6 

7 

8 


Dot 


Select i € B\ {C UU} 


success, X <- runSMT(NN, i) > x is an input for NN 
if success then 
9: J <cal1lMLP(x) > compute all possible coverage items for x 
10: C=CUS 
11: else 
12: Uu=U U{i} 
13: end if 


14: end while 
15: end procedure 


arbitrary samples from R4, i.e., we allow Z3 to also generate out-of-ODD samples. 
Naturally, the achievable coverage using only in-ODD samples can be at most as 
high as the coverage for the entire R4. 

As a baseline, we compare the coverage achieved by the samples from Z3 with 
two sampling-based approaches: (i) We use random sampling from the input domain 
(further described below) since it is acommonly applied strategy for data acquisition. 
(ii) Due to the compactness of our input space, we can use a grid-based approach 
of the input domain. We include this comparison to check how easy or difficult it is 
to achieve maximum coverage using sampling-based approaches. In particular, we 
use a grid-based sampling with a fixed number of samples per dimension, and we 
sample at random from a distribution. As for the structured generation, we perform 
two experiments for each approach: One permitting only in-ODD samples from Z*, 
and one also permitting out-of-ODD samples. Of course, since we know the exact 
maximum coverage for each of the considered coverage metrics from Z3, we are only 
interested in how far below the optimum the sampling-based approaches remain. 

For the grid-based approach, we sample from Z* with a resolution of 100 equidis- 
tant samples per dimension for the in-ODD samples. In addition, we sample from 
Rt C Rt, R = [-2, 2] with a resolution of 100 equidistant samples per dimen- 
sion for a mixture of in-ODD and out-of-ODD samples. For the random sampling 
approach, we sample 108 data points uniformly at random from Z* for in-ODD sam- 
ples. In addition, we sample 108 data points from a Gaussian distribution with zero 
mean and unit variance, which results in 78% out-of-ODD samples. For all data 
sets, we measure coverage based on PyTorch [PGM+19] as described above in 
Algorithm 2 for the SMT-based approach. 
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4.3 Experimental Results 


Table3 shows the number of generated samples, whereas Table 4 summarizes the 
coverage results from our experiments, both for MLP,q,. Please note that these results 
are for a single network. However, we are not interested in the specific numbers 
here but rather compare different coverage models and data generation approaches 
qualitatively on a specific network. The structure of Table 4 is as follows: Each row 
contains the results for one data generation approach. Thus, the first row contains 
the result when optimizing for positive neuron coverage (NC (+)), i.e., getting a 
positive activation for each individual neuron. The second row contains the results 
for neuron coverage in both directions (NC (both)), i.e., each individual neuron has 
at least once a positive and a negative activation. The third row contains the results 
when optimizing PWC. The final two rows contain the results for the two baseline 
sampling approaches. 

The columns of the table show the resulting coverage for the different coverage 
metrics. For each column, we provide separate columns All and ODD: The ODD 
columns contain the results when only in-ODD samples are used, whereas the All 
columns contain the result when also allowing for out-of-ODD samples. For NC (+) 
and NC (both), the cells contain the coverage results for each of the three layers, 
denoted by L1 to L3. Please note that L1 is the layer directly after the input and L3 
constructs the representation, which is used in the final classification layer L4. For 
PWC, we further split the A// and ODD columns based on the layer combinations 
that are considered. The L1—L2 columns contain the results for the coverage of the 
interaction between the neurons on layers L1 and L2 while the L2—L3 columns 
contain the results for the interaction between L2 and L3, respectively. In these 
columns, the entries in the single cells refer to the coverage of the four possible 
combinations of activations for the considered pairs of neurons. As an example, in 
column L1-L2, “+ —” refers to all cases where a neuron in L1 has a positive activation, 
and a neuron in L2 has a negative activation. In each row, we marked the cells with 
gray background where the reported coverage corresponds to the optimization target 
in Z3. These entries signify the maximally possible coverage values. 

In the following, we highlight some interesting aspects of the results. Note that, 
in general, the coverage in the A// column must be larger or equal to the coverage in 
the ODD column for each of the coverage metrics as R D Z. 

Intuitively, we would expect that the more elaborate a coverage metric is (i.e., 
the more elements it requires to be covered), the more test samples are required to 
achieve coverage. In our results, this expectation is satisfied, i.e., we need more tests 


Table 3 Number of tests per approach 
NC (+) NC (both) 


All 6 10 39 108 108 
ODD 5 10 47 108 108 
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Table 4 Results from the coverage analysis, L means layer £. The rows show different data 
generation approaches (solver-based and sampling-based). The columns show coverage based on 
different coverage metrics. Grey shaded cells show the results of metric that was optimized in SMT 


Data Results for different coverage metrics 
Generation 
Approach 
NC (4) NC (-) PWC 
All ODD All ODD All ODD 
L1-L2 L2-L3 L1-L2 L2-L3 
NC (+) L1: 1.00 | L1: 1.00 | L1: 1.00 | L1: 0.50 0.85 +: 0.82 : 0.67 | ++: 0.57 
L2: 1.00 | L2: 0.80 |L2: 0.90 |L2:0.60 | +—: 0.81) +—: 0.72| +—: 0.56 | +—: 0.37 
L3: 1.00 | L3: 0.83 |L3: 1.00 | L3: 0.50 : 0.62 | —+: 0.83 : 0.34 | —+: 0.47 
: 0.69 : 0.73 : 0.26 : 0.27 
NC (both) |L1: 1.00 |L1: 1.00 |L1: 1.00 |L1: 1.00 : 0.89 | ++: 0.92 0.78 | ++: 0.67 
L2: 1.00 | L2:0.80 | L2:1.00 | L2:1.00 | +—: 0.84) +—: 0.80) +—: 0.81) +—: 0.72 
L3: 1.00 | L3:0.83 | L3: 1.00 | L3: 1.00 : 0.79 | —+: 0.78 0.65 +: 0.82 
: 0.77 : 0.77 : 0.68 : 0.53 
PWC L1: 1.00 | L1: 1.00 |L1: 1.00 |L1: 1.00 | ++: 0.98| ++: 0.95| ++: 0.80) ++: 0.67 
L2: 1.00 | L2: 0.80 |L2: 1.00 | L2: 1.00 : 1.00 : + 0.94 | +—: 0.77 
L3: 1.00 | L3: 0.83 | L3: 1.00 | L3: 1.00 | —+: 0.97) 0.95—+: | —+: 0.78) —+: 0.82 
——: 1.00} 0.98 ——: | ——: 0.88 | ——: 0.77 
0.97 
Grid L1: 1.00 |L1: 1.00 | L1: 1.00 | L1: 1.00 : 0.97 | ++: 0.82 : 0.79 | ++: 0.67 
L2: 1.00 | L2:0.80 | L2: 1.00 |L2: 1.00 | +—: 0.99) +—: 0.92) +—: 0.93 | +—: 0.73 
L3: 0.83 | L3: 0.83 | L3: 1.00 | L3: 1.00 : 0.96 | —+: 0.82 : 0.76 | —+: 0.82 
: 0.95 : 0.88 : 0.88 : 0.75 
Sample L1: 1.00 |L1: 1.00 |L1:1.00 | L1: 1.00 | ++: 0.98) ++: 0.82) ++: 0.80) ++: 0.67 
L2: 1.00 | L2:0.80 | L2: 1.00 | L2: 1.00 : 1.00 : 0.92 : 0.94 | +—: 0.75 
L3: 0.83 | L3: 0.83 | L3: 1.00 | L3: 1.00 +: 0.96 : 0.83 | —+: 0.78 | —+: 0.82 
: 0.99 : 0.93 : 0.88 : 0.77 


for PWC than for NC, and we need more tests for NC (both) than for NC (+) as 
shown in Table 3. In addition, we would expect that coverage is proportional to the 
number of tests for the first three rows, i.e., more tests imply higher coverage, given 
that we only add new tests if they actually increase coverage. Interestingly, this is 
not satisfied for PWC, where we generate more tests for in-ODD samples than for 
All, even though the resulting coverage for ODD is lower than for All. It seems that 
in our particular MLP, it is easier to stimulate multiple neurons at once when using 
out-of-ODD samples. 

Let us now consider NC. When using arbitrary test samples, it is possible to 
achieve 100% coverage for both NC (+) and NC (-). However, when restricting the 
generation to only in-ODD samples, i.e., a subset of the complete input space of the 
MLP, full coverage can no longer be reached for NC (+). In particular, we observe 
that some neurons on layers L2 and L3 can no longer be stimulated to have a positive 
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activation. Considering the use of SMT leading to this observation, we thus not 
only have empirical evidence but also a tool-based mathematical proof that 100% 
coverage is not possible in this case.’ 

If you compare the coverage in the NC (+) column with the corresponding cov- 
erage for the two sampling-based approaches, you will notice that they achieve the 
same non-100% coverage for in-ODD samples and a smaller non- 100% coverage for 
All samples. Thus, the sampling-based approaches failed to achieve 100% coverage 
for NC (+) in both cases. While we know in this example that no better result was 
theoretically possible for in-ODD samples, more coverage could have been possible 
for All. However, for a more complex task, these two cases cannot be distinguished, 
i.e., if we have coverage less than 100%, we cannot decide whether it is impossible to 
achieve more or whether our test set is yet incomplete. In addition, the result shows 
that even in this simple domain, a relatively high number of samples do not achieve 
maximal coverage. Thus, it is highly unlikely to achieve the theoretically possible 
coverage with respect to more involved metrics such as PWC on complex domains 
in computer vision using random or grid-based sampling. 

As stated above, we can achieve higher coverage when also allowing for out-of- 
ODD samples. In our example, the network was not trained on such out-of-ODD 
samples, so the question remains whether it is actually meaningful to include such 
samples in a test set just for increasing coverage. In our example, we can give some 
intuition on this. For this, we look in detail at samples generated by our SMT-based 
approach with and without input constraints. The results are shown in Fig. 5, where 
we plot samples SMT generated for the NC (both) criterion. Each run creates 10 
samples, as is shown in Table4. Each of these samples is labeled according to our 
labeling function, and the corresponding cross-entropy loss J©® is computed. Since 
the loss varies drastically, we show a logarithmic plot and since the loss can be zero 
for perfect predictions, we specifically plot 


log(J + 1). (1) 


Remember from Table 3 that there are 10 samples, each for the full input space and 
the ODD space. As we can see, for both input domains, for the first three samples, the 
network generates perfect predictions and therefore log( JE + 1) := 0. For the ODD 
samples, we can see that the loss is quite low and at most in the order of the maximum 
training loss (as indicated by the dotted line). We also annotate for each sample of 
the full input space (All) whether the resulting sample is in the ODD, labeled as All 
(in) and indicated by a dot, or whether the sample is out-of-ODD, labeled as All (out) 
and indicated by a diamond shape. As we can see, when optimizing coverage without 
constraints (All), most of the samples (7/10) produced by Z3 are out-of-ODD, even 
though also in-ODD samples would exist for the most part. In the figure, we can 
see that these out-of-ODD samples are not suitable for a realistic evaluation: Each 
out-of-ODD sample has a very high loss, at least an order of magnitude higher than 


5 Please note that the large-scale experiments with random and grid-based sampling provide addi- 
tional empirical evidence. 
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Fig. 5 Log-modified cross-entropy losses (see (1)) on samples generated by SMT for NC (both). 
ODD refers to in-ODD samples, All (in/out) refer to in-ODD/out-of-ODD samples sampled from 
a mixed in-ODD/out-of-ODD input space, and Max train is the maximum training loss 


the maximum training loss, marked by a dotted line. It can also be observed that 
in-ODD samples have a significantly lower loss. Nevertheless, even here, we see that 
the SMT-based approach generates two very difficult examples. 

Intuitively, we would expect that achieving a high coverage is more difficult 
on deeper layers because achieving a specific activation also requires establishing 
necessary activations on the previous layers. This means these previous layers work 
like constraints for the computation of an input that creates an activation on a deeper 
layer. Thus, the deeper the layer, the more constraints need to be satisfied. While our 
results confirm this intuition in most cases, these are not as strong as you can see, 
e.g., when comparing the coverage for PWC. Here, the coverage for L2—L3 is only 
slightly lower in three out of four cases than for L1—L2, and in the fourth case, it is 
even higher for the deeper layers. This is most probably the case because the network 
is rather shallow. 

Note that we additionally performed these experiments on MLPpjain- While the 
results are largely the same in direction (reductions), the magnitude of differences is 
slightly larger. We hypothesize that this is due to the lower robustness of representa- 
tions and, thus, preferred to show the results for the adversarially trained MLP. 
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4.4 Discussion 


Our results for the formal analysis of achievable coverage clearly show that even 
for simple networks, 100% coverage may not be achievable. This has two reasons: 
(i) Networks are trained in a way that they contain some activations that may only 
ever be used in one direction. We can see this in the A// columns of Table4 that 
have coverage levels < 100%. In traditional software, we would refer to this as 
“dead code” that results in non-perfect coverage. One could argue that as with “dead 
code” in traditional software, there may be a way to remove this by some form of 
static analysis. (ii) Even if we remove this “static dead code”, we can still see a 
difference between All and ODD. This difference is harder to identify in practice 
as—in contrast to our simple example—this distinction between All and ODD is not 
always obvious for computer vision tasks. Hence, this is a harder form of “dead code” 
depending on the particular input/operational distribution. Both of these sources 
result in incomplete coverage. 

Let us summarize major differences between state-of-the-art DNNs used, e.g., in 
computer vision for automated driving functions, and our presented MLP example: 


e In a complex input domain, we do not have a clear definition of ODD and All, 
and we may not be able to decide whether a specific input is reasonable or not. 
For example, it is not clear whether a new image lies in the challenging part of the 
ODD or is considered to lie outside the ODD. 

e We usually want to focus on in-ODD examples. 

e Formal approaches such as the one used in these examples do not scale to 
typical state-of-the-art DNNs. Note that we cannot directly use more efficient, 
approximate works, e.g., on adversarial examples, since they perform an over- 
approximation while we are interested in under-approximations for coverage. 


As a result, the identification of both sources of “dead code” is typically infeasible. 
This means that we do not know the real upper threshold for achievable coverage 
and must over-approximate it with the complete number of coverage items, including 
both sources of “dead code”. 


5 Conclusion 


In summary, while neuron coverage remains promising in certain aspects for V&V, 
it comes with many caveats. The granularity of the metric in use determines how 
effective the metric is at uncovering faults. The granularity of application of the 
metric, layerwise instead of full model, also reveals further interesting behavior of 
the model under test. Similarly, when studying the novelty of input samples, we 
saw that random inputs show more novelty and lead to more coverage due to their 
unexpected nature. Structured input with a trained network has lower novelty in 
comparison. While test generation using NC focuses on generating realistic novel 
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samples, it is unclear if the increase in coverage from these generated samples is 
due to some random noise injected into the generated image or meaningful semantic 
change. 

Coverage metrics are also hard to interpret if the metric converges to a value 
below 100%. This means if we determine that a certain neuron is not coverable, we 
cannot determine (i) whether it is coverable with some input, and moreover (ii) with 
some input inside our ODD. Note that, depending on the ODD, using an adversarial 
example in some €-environment may not be reasonable for the ODD. Consequently, 
the coverage metrics are not actionable in the sense that we can derive information 
on which particular inputs we lack in our test set. 

As we have shown, filling the test set with out-of-ODD samples for increasing 
coverage has a questionable effect. Since they are out-of-ODD, the network under 
test will usually not be trained on such samples. As a result, the network needs 
to extrapolate for classifying them and yields a high loss in our samples in Fig. 5. 
However, in real tasks, we do not necessarily know whether the newly generated data 
is out of the ODD, and the high loss may indicate that such samples should further 
be considered, e.g., added to the training set. 
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Abstract Modern deep neural networks (DNNs) are achieving state-of-the-art 
results due to their capability to learn a faithful representation of the data they are 
trained on. In this chapter, we address two insufficiencies of DNNs, namely, the 
lack of robustness to corruptions in the data, and the lack of real-time deployment 
capabilities, that need to be addressed to enable their safe and efficient deployment 
in real-time environments. We introduce hybrid corruption-robustness focused com- 
pression (HCRC), an approach that jointly optimizes a neural network for achieving 
network compression along with improvement in corruption robustness, such as 
noise and blurring artifacts that are commonly observed. For this study, we primarily 
consider the task of semantic segmentation for automated driving and focus on the 
interactions between robustness and compression of the network. HCRC improves 
the robustness of the DeepLabv3+ network by 8.39% absolute mean performance 
under corruption (mPC) on the Cityscapes dataset, and by 2.93% absolute mPC 
on the Sim KI-A dataset, while generalizing even to augmentations not seen by 
the network in the training process. This is achieved with only minor degradations 
on undisturbed data. Our approach is evaluated over two strong compression ratios 
(30% and 50%) and consistently outperforms all considered baseline approaches. 
Additionally, we perform extensive ablation studies to further leverage and extend 
existing state-of-the-art methods. 
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1 Introduction 


Motivation: Image classification [KBK20], object detection [YXC20], machine 
translation [SVS 19], and reading comprehension [ZHZ1 1] are just some of the tasks 
where deep neural networks (DNNs) excel at. They have proven to be an effective way 
to extract information from enormous amounts of data, and they are only expected 
to become more advanced over time. Despite their rapid progress, two insufficien- 
cies of DNNs need to be addressed before deployment in real-time systems. First, 
in real-world applications, the edge devices on which these networks are deployed 
have limited capabilities in terms of the availability of memory and computational 
complexity (operations per second) that are required for neural network deployment. 
Second, the DNNs suffer from being not robust to even slight changes in the input 
(such as noise and weather conditions), which makes deployment in safety-critical 
applications challenging (Fig. 1). 


Lack of DNN efficiency: To overcome the lack of efficiency, techniques such as 
pruning [HZS17, MTK+17, TKTH18], quantization [CLW+16, JKC+18, JGWD19] 
including quantization-aware trainings [GAGN 15], knowledge distillation [HVD 14], 
and encoding techniques [HMD16] are commonly used. All of these strategies seek 
to take advantage of the available redundancy in large DNNs to achieve run-time 
speedup. 


Lack of DNN robustness: In addition to this insufficiency, recent studies [AMT 18, 
HD19, BHSFs19] show that DNNs are not robust to even slight changes to the input 
image. These changes vary from carefully crafted perturbations called adversarial 
attacks [XLZ+18, DFY+20, BK V+20], to real-world augmentations such as snow, 
fog, additive noise, etc. [HD19]. The changes in the input image could vary from 
changes in just a few pixels [SVS19] to more global changes such as contrast and 
brightness [ZS 18]. In the real world, such local or global changes are to be expected. 
For example, varying lighting conditions or foggy weather conditions can cause 
changes in the brightness and contrast of the input image. 


Compression Paradigm Robustness Paradigm 


robust to robust to 
changes in input Automated Z changes in input 
Driving 
Function 
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real-time capable 
on target hardware vÉ 


X on target hardware 
a often not : 
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Fig. 1 Automated driving functions underly the two possibly opposing goals of compression and 
robustness. Approaches in the compression paradigm (left) are focused on enabling real-time effi- 
cient networks and rarely consider the effect of such a compression on the robustness properties 
of the network. Similarly, the robustness paradigm (right) typically does not consider real-time 
properties of the network 
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In this chapter, we tackle the insufficiencies that were mentioned above and intro- 
duce hybrid corruption-robustness focused compression (HCRC), an approach to 
jointly optimize a neural network for achieving network compression along with 
improvement in corruption robustness. By corruption, we refer to real-world aug- 
mentations, such as noise, blur, weather conditions, and digital effects, which are 
commonly occurring in the real world, and are therefore of significance. Our major 
contributions in this chapter are described below. 

First, HCRC focuses on real-world corruption robustness and proposes a hybrid 
compression strategy, combining pruning and quantization approaches. We obtain a 
more robust and compressed network and also perform comparisons with sequential 
application of robustification and compression methods. Second, we approach the 
problem of robustness by training with augmentations in a controlled severity fash- 
ion. With our method, we show a further improvement under corruption (rPC) not 
only to the corruptions used during training but also to unseen corruptions including 
noise and blurring artifacts. Third, since all the methods discussed so far are only eval- 
uated on small datasets for image classification, such as MNIST [LBBH98], CIFAR- 
10 [Kri09], and SVHN [NWC-+11], there remains the question of their transferability 
to complex tasks, such as semantic segmentation [XWZ+17]. We, for the first time, 
perform such a study on two road-scenes datasets (Cityscapes [COR+16] and Sim 
KI-A) and a state-of-the-art semantic segmentation DeepLabv3+ [CPK+18] net- 
work. sss 

This chapter is structured as follows: In Sect. 3, we describe the individual compo- 
nents of such a system and our HCRC methodology in detail. In Sect. 4, we describe 
the corruptions that are used during training and evaluation, the datasets that are 
used, and the metrics used in the experiments. In Sect.5, we present our experimen- 
tal results and observations. Finally, in Sect.6, we conclude our chapter. 


2 Related Works 


It is only recently that there have been studies to investigate the interaction between 
the two techniques, model compression and network robustness that tried to individ- 
ually address the above-mentioned insufficiencies of DNNs. 

Zhao et al. [ZSMA19] report one of the first investigations to empirically study 
the interactions between adversarial attacks and model compression. The authors 
observe that a smaller word length (in bits) for weights, and especially activa- 
tions, makes it harder to attack the network. Building upon the alternating direction 
method of multipliers (ADMM) framework introduced by Ye et al. [YXL+19], Gui 
et al. [GW Y+19] evaluated the variation of adversarial robustness (FGSM [GSS15] 
and also PGD [MMS+18]) with a combination of various compression techniques 
such as pruning, factorization, and quantization. In summary, so far we observe that 
network compression affects adversarial robustness, and a certain trade-off exists 
between them. The extent of the trade-off and the working mechanism behind it 
remains unsolved [WLX+20]. 

Some works have used compression techniques such as pruning and quantiza- 
tion that were traditionally used to obtain network compression, to improve the 
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robustness of networks. For example, Lin et al. [LGH19] use quantization not for 
acceleration of DNNs, but to control the error propagation phenomenon of adver- 
sarial attacks by quantizing the filter activations in each layer. On a similar line, 
Sehwag et al. [SWMJ20] propose to select the filters to be pruned by formulating 
an empirical minimization problem by incorporating adversarial training (using the 
PGD attack [MMS+18]) in each pruning step. Very recently, in addition to proposing 
the new evaluation criterion AER (that stands for accuracy, efficiency, and robustness) 
for evaluating the robustness and compressibility of networks, Xie et al. [XQXL20] 
describe a blind adversarial pruning strategy that combines adversarial training along 
with weight pruning. 

In this chapter, we focus on real-world corruption robustness as opposed to the 
robustness to adversarial attacks. Additionally, we focus on the study of the inter- 
actions between robustness, quantization, and pruning methods within our proposed 
approach, supported by ablation studies. 


3 HCRC: A Systematic Approach 


Our goal is to improve the robustness of common image corruptions and at the same 
time reduce the memory footprint of semantic segmentation networks in a systematic 
way. In this section, we describe our systematic hybrid corruption-robustness focused 
compression (HCRC) approach to achieve compressed models that are also robust 
to commonly occurring image corruptions. Our proposed system can be broadly 
divided into two objectives: the robustness objective and the compression objective. 
Both will be described in the following subsections. 


3.1 Preliminaries on Semantic Segmentation 


We define x € I¥*W*C to be aclean image of the dataset X, with the image height H, 
image width W, C=3 color channels, and I = [0, 1]. The image x is an input to a 
semantic segmentation network F(x, 0) with network parameters 0. Further, we refer 
to a network layer using an index £ € £L = {1,..., L}, with £ being the set of layer 
indices. Within layer £ we can define 0; € R! * We to be the kth kernel, where 
k € Ke ={l,..., Ke}, with the set of kernel indices Ky of layer £. The image input 
x is transformed to class scores by 


y = F(x, 0) e IF * YS, (1) 


Each element in y= (yi s) is a posterior probability y; s(x) for the class s € S = 
{1,2,..., S} at the pixel position’ € Z = {1,..., H - W} of the input image x, and 
S denoting the number of semantic classes. A segmentation mask m = (m;) € S”*W 
can be obtained from these posterior probabilities with elements 
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mi = argmax Yis, (2) 
seS 


by assigning a class to each pixel i. The accuracy of the prediction is evaluated by 
comparing this obtained segmentation mask m against the labeled (ground truth) seg- 
mentation mask m € M, that has the same dimensions as the segmentation mask m. 
Likewise, y € {0, 1}¥”*W*5 is the one-hot encoded vector ground truth in three- 
dimensional tensor format that can be retrieved from m. 


3.2 Robustness Objective 


Data augmentation: In Fig. 2, the green data augmentation block on the left depicts 
the image pre-processing method following Hendryks et al. [HMC+20]. Here, the 
input image is augmented by mixing randomly sampled corruptions. The key idea is 
to introduce some amount of randomness in both, the type and the superposition of 
image corruptions. To achieve this, the input image is first split into three parts and 
then passed as an input to the data augmenter sub-blocks. Within a data augmenter 
sub-block, initially, a uniformly sampled corruption A, € A" is applied to the 
input. Here, Alin — {A,, Ao,..., Ay} denotes a set of N pre-defined corruption 
functions A,,() that are used during training. The corresponding corrupted image is 
computed as 

Š = A, (x, Y) e I7” WC, (3) 


where A,,(x, Y) is the image corruption function and Y is a parameter controlling 
the strength of the applied augmentation. This random sampling and augmentation 
operation is repeated consequently R = 4 times within each of the data augmenter 
sub-blocks. 


Data Augmentation Computation of Losses 


x 
image data 


| Sample and | - | Ler Jensen-Shannon 
| Apply A c Oy "Divergence Loss 
\ from DNN 


| Aten | 


database of pre-defined 
corruptions 


Fig.2 Overview of the data augmentation strategy (left) and loss construction (right) for a semantic 
segmentation DNN 
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The output of each of the N data augmenter sub-blocks is, therefore, an augmented 
image 


R R 
oS X 0 Sy APG, V), n € {1,2,..., N}, (4) 
r=1 


r=1 


that is a combination of R applications of corruptions from A", Choosing N =3, 
these outputs are first passed to multipliers with weights w1, w2, and w3, which 
are sampled from a Dirichlet distribution PP""*t and then added. Thereafter, the 
added output is multiplied by a factor y that is sampled from a beta distribution PB 
with parameters a=1 and 6=1. Further, this is added to the input image x, which 
is multiplied by a factor 1 — y to obtain the augmented image x, with b € B = 
{1,2,..., B} denoting the index among the B final augmented images being used 
in our proposed training method. Note that B + | is our minibatch size, where one 
original image and B augmented images are being employed. 


Construction of losses: In Fig.2, the gray block on the right shows the strategy to 
construct losses from the predictions of the semantic segmentation network. Follow- 
ing (1), ¥ denotes the class scores for an augmented input image ¥”). In addition 
to the aforementioned data augmentation strategy in the pre-processing stage, a loss 
function with an auxiliary loss term is introduced to enforce regularization between 
the responses of the semantic segmentation network to clean and augmented images 
in the training stage. The total loss is defined as 


J= IF easy, (5) 


JŒ isp 


where is the cross-entropy loss and is the auxiliary loss, also called the 
Jenson-Shannon divergence (JSD) loss [MS99]. The à term is a hyper-parameter 
introduced to adjust the influence of J’SP on the total loss J. The cross-entropy 
loss J is computed between the posterior probabilities y of the network conditioned 
on input x and its corresponding labels y. It is defined as 


1 = 
JF = a DODO Yis logis), (6) 


icT seS 


by taking a mean over all pixels for the posterior probability y, where œ, are the 
weights assigned to each class during training, following [WSC+20]. The auxiliary 
loss, or the Jenson-Shannon Divergence (JSD) loss, is defined as 


1 
JSP = — . (KL(y, 9) + X KLG™, §)). (7) 
TS 2 


It is computed between the posterior probabilities y and y or Ẹ®, where b € B = 


{1,..., B}. Note that ý = zH (Y+ X peg J”) being the mixtures of the prob- 


abilities, and Ẹ® = F(x”), 0). The auxiliary JSD loss is introduced to reduce the 
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variation in the probability distributions of the predictions between a clean input and 
an augmented input. To do this, two Kullback-Leibler (KL) divergence terms are 
introduced in (7), e.g., 


zb) 
KLE”, j) = DH” log (£) (8) 


icT ; 


defining a distribution-wise measure of how one probability distribution (here: ¥) 
differs from the reference mixture distribution (here: y;). 


3.3 Compression Objective 


Network pruning: We define a neural network as a particular parameterization of an 
architecture, i.e., F(x, 0) for specific parameters 0. Neural network pruning entails 
taking as input a model F(x, 0) and producing a new network F(x, M © 0). Here 6 
is the set of parameter values that may be different from 0, but both sets are of the 
same size |0| = ið |, and M € {0, 1}!°! is a binary mask that forces certain parameters 
to be 0, while © is the element-wise product operator. In practice, rather than using 
an explicit mask, pruned parameters of 0 are fixed to zero and are removed entirely. 

We focus on producing a pruned network F(x, M © 6) from a network F(x, 00), 
where ĝo is either sampled from an initialization distribution, or retrieved from a 
network pretrained on a particular task. Most neural network pruning strategies build 
upon [HPTD15], where each parameter or structural element in the network is issued 
a score, and the network is pruned based on these scores. Afterward, as pruning 
reduces the accuracy of the network, it is trained further (known as fine-tuning) to 
recover this lost accuracy. The process of pruning and fine-tuning is often iterated 
several times (iterative pruning) or performed only once (one-shot pruning). 

In this chapter, we adopt the magnitude-based pruning approach [HPTD15] that 
is described in Algorithm 1. Although there exists a large body of more sophisti- 
cated scoring algorithms, the gain with such algorithms is marginal, if at all exist- 
ing [MBKR18]. Based on the number of fine-tuning iterations F, the number of 
filter weights to be pruned FP™"*¢ (see Algorithm 1), the total number of prunable 
filter weights F'°“', and the type of pruning (see Algorithm 1, Iterative Pruning), the 
function returns a sparser network @°™"¢ and the binary weight mask M. 
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Algorithm 1 Magnitude-Based Filter Pruning and Fine-Tuning (Iterative Pruning) 


1: Input: F ter, the number of iterations of fine-tuning, 
2: æa the dataset to train and fine-tune, 


3: Fl, the total number of prunable filter weights, 

4: FPned the number of filter weights to be pruned, and 

5: IterativePruning, a boolean. If true: iterative pruning, if false: one-shot 
pruning. 

6:0 < initialize() > Random/ImageNet pretrained weights initialization 

7:0 < trainToConvergence(F(x, 6)) > Standard network training 

8: M < rank(0, FP™™ed) > Filter weight rank computation (one-shot pruning) 


9: for i in 1 to F do 
10: if IterativePruning then 


11: Fourent <_ q — ¢fpruned y ptotalyi/ F" > Adapt rule of pruning to iterative pruning 
12: M < rank pPruned | pcurrent) > Updating M 
13: endif 

14: gPmned & fineTune(F(x, MO 6)) > Network fine-tuning with sparsed weights 
15: end for 

16: Output: 6°™"**, the pruned network, 

17: M, the binary weight mask vector 


Quantization: Low precision fixed-point representations replace floating-point 
number representations in fixed-point scalar quantization methods. Fixed-point 
scalar quantization operates on single weights 0x, j of the network parameters, 
where the floating-point format weights are generally replaced by Q-bit fixed-point 
words [GAGN15], with the extreme case of binarization(Q = 1) [CBD15]. We focus 
on this uniform rounding scheme instead of other non-uniform schemes because it 
allows for fixed-point arithmetic with implementations in PyTorch. Quantization 
of network weights contributes to a large reduction in the model size and gives pos- 
sibilities for acceleration on target hardware. In-training quantization (ITQ) refers to 
training a network by introducing quantization errors (12) in the network weights, 
and post-training quantization (PTQ) refers to quantizing the weights of a network 
after the training process by calibrating on the X"! and/or the X! set. Figure 3 
gives an overview of the in-training quantization (ITQ) and post-training quantiza- 
tion (PTQ) methods that are used in this chapter. Here, 0 corresponds to the neural 
network which is the input to the quantization methods, and 6™*™ refers to the fixed- 
point quantized codewords with lower precision. To do so, in the first block, the 
statistics for the scale factor 


max ; O¢,k, j = min; ek, j 
22-1 i 


(9) 


Pik = 


which defines the spacing between bins, and the bias 


Dace 
dex = round (meu tt) (10) 
PL,k 
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In-Training Quantization (ITQ) 
p, ô 


Compute 
Running 
Statistics 


Post-training Quantization (PTQ) 
p, ò 


Statistics on gouant 


PA 


Fig. 3 Overview of quantization methods used in this work. Top: Within the in-training quantiza- 
tion (ITQ), for each iteration, based on the computed scale p = (p¢,x) (9) and bias 6 = (ôg, k) (10) 


terms, the network parameters 0 are first quantized to 94"*" (11). Thereafter, each element in Oe. 
is converted back to its floating-point representation based on a LUT (12). Bottom: Within the 
post-training quantization (PTQ), similar statistics (p, 6) for a trained network are computed on the 
training and validation set and the network is quantized to 94" (11) 


by which the codewords are shifted, are computed. Here, j € Jek, with J, is the 
set of parameter indices of kernel k in layer £. Thereafter, for both ITQ and PTQ, 
each weight 6, x, ; is mapped to its closest codeword Oey by quantizing 6r ;,; using 


Of, = min(q™, max(q™", round (8e, j/ Pex + 5¢.k)))- (11) 
Here, g™" and g™** correspond to the minimum and maximum of the range of quan- 


tization levels depending on the chosen Q- bit quantization. For example, for an 
8-bit quantization, g™" =0 and g™* = 255. For ITQ training, the oF hase param- 
eters fey are converted back to floating-point representation 0,,, based on an 
integer-float lookup table (LUT) following 


One; = Onn — bek) + Pek: (12) 


This means that quantization errors are introduced within the network parame- 
ters 92’, which are then used within the training process. For PTQ, the quantized 
parameters 6°" are directly used in the evaluation of the semantic segmentation 
network. 

In this chapter, we focus on the uniform rounding scheme instead of other non- 
uniform schemes, because it allows for fixed-point arithmetic with implementations 
in PyTorch. Throughout this chapter, we use a strong quantization of Q = 8 bits to 
enable higher acceleration on edge devices. 
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Fig. 4 Overview of our training strategy to co-optimize for corruption robustness along with 
network compression by the use of augmentation training, weight pruning, and quantization methods 


3.4 HCRC Core Method 


Within the HCRC framework, we systematically combine the robustness 
(Sect. 3.2), pruning, and quantization (Sect. 3.3) methods to co-optimize both robust- 
ness and compression objectives. Figure 4 gives an overview of our training strategy. 
The green block on the top depicts the augmentation strategy of the input image data 
and the consequent construction of losses. For each input image x, the augmented 
images x”) are initially computed and passed to the semantic segmentation network. 
The total loss (5) is then computed based on the clean and corrupted image pre- 
dictions. The orange block on the bottom depicts the quantization of the network 
weights and activations and the blue block contains the pruning module. We start by 
initializing the network parameters. In each training iteration, the scale factor (9) and 
the bias (10) are computed, and the network parameters are quantized (11). Addi- 
tionally, in each training epoch, we use iterative pruning which continually prunes a 
certain percentage of the weights of the network (see blue block in Fig. 4). 


4 Experimental Setup 


In this section, the details of the semantic segmentation network, the road-scenes 
datasets, and the semantic segmentation networks that have been used in this chapter 
are initially described. The image corruptions that are applied in both phases, training 
and evaluation, are then introduced. Finally, the evaluation metrics are described. 
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4.1 Datasets and Semantic Segmentation Network 


Our dataset splits are summarized in Table 1. For Cityscapes [COR+16], the base- 
line networks are trained with the 2,975 images of the training set VG". Due to 
the Cityscapes test set upload restrictions, we split the official validation set into 
two sets-a mini validation set Ae (Lindau, 59 images) and a mini test set 1" 
(Frankfurt and Miinster, 441 images). The images have a resolution of 2,048 x 1,024. 
The Sim KI-A dataset is an artificially generated dataset with 4,257 training (qian ; 
387 validation (Xgl) and 387 test (X£$t) images. The images have a resolution of 
1,920x 1,080. 

In this chapter, we use the DeepLabv3+ [CBLR18] semantic segmentation net- 
work with ResNet-101 backbone [HZRS16]. For both datasets, the baseline net- 
work, that is the network without any augmentation or compression, is trained with a 
crop size of 513 x 513 andabatchsizeof4onanNvidia Tesla V100 GPU. The 
class frequency-weighted cross-entropy loss J (6) in combination with stochastic 
gradient descent (SGD) are used as optimization criterion and optimizer, respec- 
tively. During training, a polynomial learning rate scheme with an initial learning 
rate of 0.01 and a power of 0.9 is applied. The network is trained to convergence 
for 100 epochs on the Cityscapes dataset and 50 epochs for the Sim KI-A dataset. 
For a fair comparison, all the networks are evaluated on an Intel(R) Xeon (R) 
Gold 6148 CPU. 


4.2 Image Corruptions 


The images corruptions used in this chapter are described in Table 2. These cor- 
ruptions are split into two different categories depending on their usage, i.e., either 


Table 1 Details of the road-scenes datasets used in the experiments. The image resolution of the 
dataset images and split into training, validation, and test sets are described 


Dataset Resolution vain xval Atest 
Cityscapes [COR+16] | 2,048 x 1,048 |2,975 59 441 
Sim KI-A 1,920 x 1,080 | 4,257 387 387 


Table 2 Types of image corruptions used in this work that are arranged in two categories, based 
on their usage in either the training A" or test A®*' phases 


Category | Corruption type 


Attan Auto-contrast, equalize, posterize, color, sharpness, Gaussian blur, spatter, 
saturation 
Atest Gaussian noise, shot noise, impulse noise, defocus blur, frosted glass blur, motion 


blur, zoom blur 
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Table 3 Corruptions and their parameterization used during training are listed. A dash (-) indicates 
that the corruption function is image-dependent and does not need any parameterization. An interval 
[a, b] indicates that the respective parameter is a real number R sampled uniformly from this interval 


Corruption type | Auto-contrast Equalize Posterize Color 

Parameterization | — - [0, 4.0] [0.25, 4.0] 

Corruption type | Sharpness Gaussian blur Spatter Saturation 

Parameterization | [0.25, 4.0] 8 {0.65, 0.5, 0.3, {1.5, 0.1} 
0.7, 0.65} 


during training or during test. The corruptions A‘*“' in the test process are adopted 
from the neural network robustness benchmark! from [HD19]. The corruptions Atain 
used in the training process are adopted following a large body of work [HMC+20, 
CZM+19, TPL+19, SK19] that use these corruptions in different ways for training 
with data augmentation. Table 3 gives an overview of the parameterization of each 
corruption within A", For spatter corruption, the list of parameters corresponds 
to the location, scale, two sigma, and threshold values, respectively. For saturation 
corruption, the list of parameters corresponds to the amount of saturation and the 
scale. For posterize, color, and sharpness corruptions, the parameter is sampled from 
within the given interval. 


Definition of severities: Various kinds of data augmentations exist, and it is rather 
difficult to compare between different augmentation types, although first attempts are 
known [KBFs20]. Let us take an example of brightness and contrast augmentations. 
We can increase or decrease the brightness and contrast values for a given input image 
by manipulating the image pixel values. An increase in the brightness and an increase 
in the contrast do not necessarily correspond to the same effect on the input image. 
To standardize the method of measuring the strength of augmentations irrespective 
of the augmentation type, the structural similarity (SSIM) metric [WBSS04] is used. 
To do this, SSIM is computed between the clean input image x and the augmented 
image X”), Here, SSIM(x, x) =0 indicates that the image x and the corresponding 
augmented image x are completely dissimilar. Similarly, SSIM(x, *) = 1 indi- 
cates that the image x and the corresponding augmented image x”) are identical, or 
no augmentation is applied. We define severity levels (V) to indicate the strength 
of the augmentation. Severity level V =0 indicates that no type of augmentation is 
applied to the input image and severity level V = 10 indicates that the input image 
is completely dissimilar after the augmentation. This means that for every increase 
in level in V, the SSIM between the clean input image and the augmented image 
reduces by 0.1. To control the severity of the data augmentation during training, the 
parameters (a, 6) of the y-function are varied (see Fig. 2) by keeping the corruption 
parameters constant following Table 2. 


 https://github.com/hendrycks/robustness. 
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4.3 Metrics 


Mean intersection-over-union (mloU) between the predictions of the semantic 
segmentation network and the human-annotated ground truth labels is commonly 
used for evaluating semantic segmentation networks. The mloU is defined as 


ou=2 =o) = mloU(y, F) (13) 
pentane 2 TPO EPO LENG O TN 


where TP(s), FP(s), and FN(s) are the class-specific true positives, false positives, and 
false negatives, respectively, computed between segmentation output y and ground 
truth one-hot encoded segmentation y. 

Mean performance under corruption (mPC) has been introduced by [HD19] 
for evaluating the robustness of neural networks under varying corruptions and vary- 
ing strengths. For this purpose, the individual augmentations A, € A‘ are further 
sub-divided with respect to the strength of the augmentations. We use the augmen- 
tations A*t (see Table 2) for the computation of mPC. This is computed by 


1 Atest | N. 


| Atest] 2. N. J mloU. v, (14) 
c=1 © V=1 


mPC = 


with corruption index c and N, denoting the amount of severity conditions for a cor- 
ruption c. Here, mloU, y denotes the mIoU (13) of the model under the corruption c 
and severity V. The key factor here is choosing the severities, as this can vary on dif- 
ferent datasets, and even models, depending on the selection criteria. In this chapter, 
we use the SSIM metric [HD 19] as a means of finding severity thresholds. Using this 
metric allows for standardized severities across a dataset, as it is task- and model- 
agnostic. Thus, different robustness improvement methods can be benchmarked and 
compared easily using the mPC metric. 


Relative performance under corruption (rPC) is simply the ratio of the mPC and 
the mloU of the semantic segmentation under the corruptions during evaluation, and 


is defined as 
mPC 


PC = : 
r mloU 


(15) 


4.4 Training Framework 


For the task of achieving robust and compressed semantic segmentation networks, 
one can envision various different ways to approach it. An overview of all possible 
approaches is given in Fig. 5. For all the reference models, we start from the pre- 
trained checkpoint weights of the ResNet-101 backbone for the DeepLabv3+ 
architecture. 


418 S. Varghese et al. 
(A) Robustify only 


Trained with data robustified 
augmentation network 


(B) Compress only 


compressed 
network 


robustified, 
L compressed 
networks 


ss, then Robustify 


Trained with data 
augmentation 


(E) Our Method 


Hybrid corruption robustness focussed compression (HCRC) 


Fig. 5 The training approaches used in this work are depicted. In addition to the simple baselines 
(Reference A, and Reference B), we compare our HCRC approach against sequential applications 
(Reference C, and Reference D) of the individual steps in the training framework 


Reference A: In this approach, the DeepLabv3+ network with the ResNet-101 
backbone is trained for improving its corruption robustness. Here, no compression 
techniques are applied. The network is trained using the protocol defined in Sect. 4.1 
with the total loss (5) and A=10~°. 


Reference B: Here, the DeepLabv3+ undergoes one-shot pruning (see Algo- 
rithm 1). First, the network is trained using the protocol defined in Sect.4.1 with 
the class frequency weighted cross-entropy loss (6). Next, the statistics (9), (10) are 
computed on the X"! and 4’! set and the network undergoes PTQ (see Fig. 3). No 
robustness-related training is enforced. 


Reference C: In this configuration, we perform sequential application of the robust- 
ness and compression goals. In the first step, the DeepLabv3 + is trained using the 
protocol defined in Sect. 4.1 with the total loss (5) and A= 1076 (see also Reference 
A) in combination with the data augmentation strategy described in Sect.3.2 (see 
Fig. 2). Next, the network undergoes iterative pruning (see Algorithm 1) following 
again the protocol defined in Sect. 4.1, this time with the class frequency weighted 
cross-entropy loss (6). Finally, statistics (9), (10) are computed on the V train and Xv"! 
set and the network undergoes PTQ (see Fig. 3). 


Reference D: In this setup, the DeepLabv3+, the network is first one-shot pruned 
(see Algorithm 1) and then fine-tuned following the protocol defined in Sect. 4.1 using 
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the class frequency weighted cross-entropy loss (6). In the next step, the network 
is trained with the data augmentation strategy described in Sect.3.2 (see Fig. 2) 
following again the protocol defined in Sect. 4.1, however, this time the total loss J (5) 
is used as the optimization criterion. Finally, statistics (9, 10) are computed on the 
ain and XY! set and the network undergoes PTQ (see Fig. 3). 


5 Experimental Results and Discussion 


5.1 Ablation Studies 


In-training quantization (ITQ) vs. post-training quantization (PTQ) We hypoth- 
esize that training the network with quantization errors is better than quantizing the 
network after training. 

In Table 4, we compare these two approaches of achieving quantized networks. 
The baseline DeepLabv3+ network has an mloU of 69.78% and mPC of 44.03%. 
On one hand, we observe that mIloU drops by 5.35% and mPC by 2.76% (both: 
absolute) after PTQ. On the other hand, the drop in mIoU is only 1.8% after ITQ, 
within no change in mPC. This result supports the abovementioned hypothesis on 
quantization, that in-training quantization is superior to post-training quantization. 
Additionally, we increased the size of the calibration set used within PTQ by also 
including X"! along with X’"*", This, however, resulted in no significant changes in 
the performance of the quantized networks. 


Controlled severity training: The semantic segmentation network is evaluated over 
various corruptions and various severity levels. From our initial experiments, we 
observed that the data corruptions used during the training process [HMC+20] have 
a mean severity level of V = 1. It is intuitive that a network trained on data aug- 
mentations of higher severity should be, in theory, more robust to higher severity 
corruptions during test. To study the effect of the training severity on the robustness 
of the trained semantic segmentation network, we train the DeepLabv3+ network 
with three different severities of corruption. To do this, we vary the parameters of the 


test 


Table 4 Test set XGg° evaluation of mloU, mPC, and rPC comparing the non-quantized 
DeepLabv3-+ network, and the corresponding PTQ and ITQ networks that are trained to conver- 
gence for 100 epochs. Note that the inference times are computed on the Intel(R) Xeon (R) 
Gold 6148 CPU. Best numbers reported in bold 


Method mloU [%] (13) mPC [%] (14) rPC [%] (15) Time (s) 
DeepLabv3+ 5.23 
with PTQ 2.66 


with ITQ 2.66 
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test 


Table5 Testset cg" evaluation based on mloU, mPC, and rPC comparing the three DeepLabv3 + 
networks trained on augmentations of three different severity levels. Note that the networks are not 
subjected to any kind of compression. Best numbers reported in bold 


Method mIoU [%] (13) mPC [%] (14) rPC [%] (15) 
DeepLabv3+ 69.78 44.03 63.09 
Trained with severity | 69.98 53.65 76.66 
level V = 1 
Trained with severity | 69.54 56.14 80.73 
level V = 2 
Trained with severity | 69.44 56.27 81.03 
level V = 3 
0.7 
80 
0.6 
70 
5 0.5 
E 60 
0.4 
50 
0.3 
40 
1 2 3 4 5 Baseline Severity 1 Severity 2 Severity 3 


Corruption severity level V 


= Baseline —— Severity 2 trained — = 
= Severity 1 trained==== Severity 3 trained ashing sc sts 


Fig. 6 Test set V&$' evaluation for the DeepLabv3+ network trained with different corruption 
severities. Left: The four networks are evaluated over the augmentations A with six severity levels 
(x-axis, severities V = 0, 1, ...,5). Right: The bar chart shows the mIoU on clean data (V = 0), 
as well as mPC and rPC, computed over the same six severity levels (V = 0, 1,...,5), for four 
networks trained with severity V = 0 (baseline), 1, 2, and 3 


beta distribution to increase the influence of the individual corruptions. The results 
are shown in Table 5. 

We generally observe that training with higher severities leads to higher robust- 
ness in terms of the mPC. The DeepLabv3+ network trained with a severity level 
of 3 has an increase of 2.62% absolute mPC and 4.36% absolute rPC when evaluated 
on X". In Fig. 6, we show the results of evaluating these networks on six different 
severity values. The networks trained with higher severities show higher robustness, 
especially when evaluated on higher severities (V > 3). Training with a higher sever- 
ity (V >4) did not show any further improvements. A drop in the mIoU indicates 
a certain trade-off between an increase in the generalization (to unseen corruptions) 
capability of the network to a decrease in its performance on the clean (or vanilla) 
input. 
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Table 6 Test set VEG" evaluation based on mIoU, mPC, and rPC comparing the one-shot and 
iterative pruning approaches for the DeepLabv3+ network. All the networks are trained with 
augmentations of severity level 2. A Q = 8 bits quantization is applied to all the HCRC trainings. 
Best numbers reported in bold 


Method Pruning mloU [%] mPC [%] (14) | rPC [%] (15) 
Ratio [%] (13) 

DeepLabv3+ 0 69.78 44.03 63.09 

HCRC with one-shot pruning | 30 66.06 50.80 76.89 

HCRC with iterative pruning | 30 68.12 52.50 77.07 

HCRC with one-shot pruning | 50 64.58 49.45 76.57 

HCRC with iterative pruning | 50 66.84 51.95 77.72 


Sensitivity of the pruning algorithm: We perform an ablation study to analyze 
the effect of the types of pruning methodology within our HCRC approach. To this 
end, we train the DeepLabv3+ network in a combined fashion (see Sect. 3.4) with 
two different types of pruning, namely, one-shot pruning and iterative pruning. In 
Table 6, we provide the evaluation results of this study over two different pruning 
ratios (30% and 50%). A pruning ratio of 30% indicates that 30% of the prunable 
weights are removed from the network, while 70% are remaining. For quantization, 
within all our experiments, we have used a strong quantization of Q = 8 bits. 

For 30% pruning ratio, we observe that the iterative pruning method shows an 
(absolute) increase in mIoU (2.06%), mPC (1.7%), and rPC (0.18%), when com- 
pared to the one-shot pruning method, evaluated on XES". For 50% pruning ratio, we 
observe similar (absolute) improvements for iterative pruning in its mIoU (2.26%), 
mPC (2.5%), and rPC (1.15%), computed over Ve". 


5.2 Comparison With Reference Baselines 


In this section, we compare our HCRC method (with iterative pruning) with the ref- 
erence methods (see Sect. 4.4). In particular, we compare our HCRC method against 
Reference C and Reference D, which also aim to achieve robust and compressed 
segmentation networks. 

For the Cityscapes dataset, the results of the evaluation are shown in Table 7. 
We observe that our HCRC method outperforms all the relevant reference methods 
for both pruning ratios. The reference A network with a pruning ratio of 0% shows 
an improvement of 11.62% absolute mPC over the DeepLabv3+ baseline network 
with a slight improvement in the mIoU. For 30% pruning, the HCRC shows signif- 
icant improvements over the reference methods B, C, and D. The HCRC method 
shows an improvement of 3.29% absolute mloU and 9.14% absolute mPC over the 
best reference (Reference D). Additionally, HCRC improves the robustness of the 
DeepLabv3+ network by 8.47% absolute mPC with a 77.67% reduction in the 
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Table 7 Test set VG$" evaluation comparing HCRC to reference methods A-D. A quantization 
with Q = 8 bits quantization is applied to all trainings where compression is applied. Best numbers 
reported in bold 


Pruning Method mloU [%] mPC [%] (14) | rPC [%] (15) | Model Size 
Ratio [%] (13) (MB) 
0 DeepLabv3+ 69.78 44.03 63.10 237.38 
Reference A 69.98 53.65 76.67 237.38 
30 Reference B 62.56 35.79 57.21 52.33 
Reference C 64.28 37.13 57.76 52.33 
Reference D 64.83 43.36 66.88 52.33 
HCRC (ours) 68.12 52.50 77.07 52.33 
50 Reference B 64.48 37.50 58.16 47.47 
Reference C 66.05 36.52 55.29 47.47 
Reference D 66.03 48.52 73.48 47.47 
HCRC (ours) 66.16 51.95 77.72 47.47 


model size. For 50% pruning ratio, we observe similar improvements in HCRC over 
the reference methods B, C, and D. The HCRC method shows an improvement in 
mloU (2.13%) and mPC (2.91%) when evaluated on ane and when compared to the 
best reference (reference D). Overall, HCRC with pruning ratio of 50% improves the 
robustness of the DeepLabv3+ network by 7.92% absolute mPC with an almost 
80% reduction in the model size. 

For the Sim KI-A dataset, we similarly observe that our HCRC method outper- 
forms all the relevant reference baselines for both the pruning ratios, see Table 8. The 
reference A network with a pruning ratio of 0% shows an improvement of 12.98% 
absolute mPC over the DeepLabv3+ baseline network with a slight improvement 
in the mloU. For 30% pruning, the HCRC shows significant improvements over the 
reference methods B, C, and D. The HCRC method shows an improvement of 2.33% 
absolute mIoU and 5.09% absolute mPC over the best reference (Reference D). For 
50% pruning ratio, we observe similar improvements in HCRC over the reference 
methods B, C, and D. The HCRC method shows an improvement in mIoU (1.04%) 
and mPC (3.68%) when evaluated on V§*' and when compared to the best reference 
(reference D). Overall, HCRC with pruning ratio of 50% improves the robustness of 
the DeepLabv3+ network by 7.60% absolute mPC with an almost 80% reduction 
in the model size. 

Interestingly, the clean performance of our compressed HCRC network is nearly 
the same as the uncompressed DeepLabv3+ baseline, albeit with much improved 
robustness. We also show qualitative results in Fig.7 for impulse noise of severity 
level V = 3, where we observe a significant improvement over the simpler reference 
B baseline. In summary, our proposed HCRC approach to co-optimize for corruption 
robustness and model compression outperforms all possible reference baselines and 
produces a network that is heavily compressed and robust to unseen and commonly 
occurring image corruptions. 
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Table 8 Test set YE“ evaluation comparing HCRC to reference methods A-D. A quantization 


with Q = 8 bits is applied to all trainings where compression is applied. Best numbers reported in 
bold 


Pruning Method mloU [%] mPC [%] (14) | rPC [%] (15) | Model Size 

Ratio [%] (13) (MB) 

0 DeepLabv3+ | 77.57 54.42 70.16 237.38 
Reference A 77.05 67.40 87.48 237.38 

30 Reference B 74.19 49.78 67.10 52.33 
Reference C 72.44 51.56 71.17 52.33 
Reference D 73.90 58.70 79.43 52.33 
HCRC (ours) 76.23 63.79 83.68 52.33 

50 Reference B 75.65 47.60 62.92 47.47 
Reference C 75.08 47.82 63.69 47.47 
Reference D 74.38 58.34 78.43 47.47 
HCRC (ours) 76.12 62.02 81.48 47.47 

Clean Input Clean Prediction Ground Truth Corrupted Prediction Corrupted Input 
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v 
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55 
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Fig. 7 Example segmentations on the Cityscapes dataset. We show a snippet from ¢¢', where the 


differences in the robustness of the compressed networks are more pronounced. We observe that 
our HCRC method is compressed and has superior robustness to the DeepLabv3+ baseline and 
the compressed network of reference B, in this example, for impulse noise corruption 


6 Conclusions 


In this chapter, we introduce hybrid corruption-robustness focused compression 
(HCRC), an approach to jointly optimize a neural network for achieving network 
compression along with improvement in corruption robustness, such as noise and 
blurring artifacts, which are commonly observed. For this study, we consider the 
task of semantic segmentation for automated driving and look at the interactions 
between robustness and compression of networks. HCRC improves the robustness 
of the DeepLabv3+ network by 8.47% absolute mean performance under corrup- 
tion (mPC) on the Cityscapes dataset and 7.60% absolute mPC on the Sim KI-A 
dataset and generalizes even to augmentations not seen by the network in the train- 
ing process. This is achieved with only minor degradations on undisturbed data. 
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Our approach is evaluated over two strong compression ratios and consistently out- 
performs all considered baseline approaches. Additionally, we perform extensive 
ablation studies to further leverage and extend existing state-of-the-art methods. 
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